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O'REILLY 


Hands-On 
Machine Learning 
with Scikit- Learn 


CONCEPTS, TOOLS, AND TECHNIQUES 
TO BUILD INTELLIGENT SYSTEMS 





Aurélien Géron 


based on "Hands-On Machine Learning with Scikit-Learn & TensorFlow" (O'Reilly, 
Aurelien Geron) 


book chapters 


1) Intro to Machine Learning 2) Example end-to-end Machine Learning project 
(California Housing dataset) 3) Basic Classification 4) Training Technigues 5) 
Support Vector Machines 6) Decision Trees 7) Ensemble Learning & Random 
Forests 8) Dimensionality Reduction 9) TensorFlow Installation & Checkout 10) 
TensorFlow & Neural Nets 11) TensorFlow Training 12) TensorFlow on Distributed 
Hardware 13) Convolutional Neural Nets 14) Recurrent Neural Nets 15) 
Autoencoders 16) Reinforcement Learning 


Get Dataset & Create Workspace 


import os 
import tarfile 
from six.moves import urllib 


import pandas as pd 
import numpy as np 


DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handso 
n-ml/master/" 

HOUSING_PATH = "datasets/housing" 

HOUSING URL = DOWNLOAD ROOT + HOUSING PATH + "/housing.tgz" 


def fetch housing datal( 
housing url-HOUSING URL, 
housing path-HOUSING PATH): 


# create datasets/housing directory if needed 
if not os.path.isdir(housing path): 
os.makedirs(housing path) 


tgz path = os.path.join(housing path, "housing.tgz") 


# retrieve tarfile 
urllib.reguest.urlretrieve(housing url, tgz path) 


# extract tarfile & close path 

housing tgz - tarfile.open(tgz path) 
housing tgz.extractall(path-housing path) 
housing tgz.close() 


def load housing datal 
housing path-HOUSING PATH): 


CSv path = os.path.join(housing path, "housing.csv") 
return pd.read csv(csv path) 


s do it 
Hfatch hr cina data( ) a a] raadv dow ] naded - ctatic datacet 
#Tetcn nousiıing_aatal ) already aownL0adea StatiC datasel 


housing = load housing data() 


Data structure - guick peek 


housing.head() 


longitude latitude housing median age total rooms total be 


0 -122.23 37.88 41.0 880.0 129.0 
1 | -122.22 37.86 21.0 7099.0 1106.0 
2 -122.24 37.85 52.0 1467.0 190.0 
3 -122.25 37.85 52.0 1274.0 235.0 
4 -122.25 37.85 52.0 1627.0 280.0 


So... what's in the dataset? 


# housing is a Pandas Dat: 





f striirhan EE eae ENE 
untouched datafile: 


housing.info() 


<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 20640 entries, O to 20639 


Data columns (total 10 columns): 


longitude 

latitude 

housing median age 
total rooms 

total bedrooms 
population 
households 

median income 
median house value 
ocean proximity 


20640 
20640 
20640 
20640 
20433 
20640 
20640 
20640 
20640 
20640 


non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 


dtypes: float64(9), object(1) 
memory usage: 1.6+ MB 


let's see if 


housing[ 'ocean_proximity'].value_counts() 


<1H OCEAN 9136 
INLAND 6551 
NEAR OCEAN 2658 
NEAR BAY 2290 
ISLAND 5 


ocean 


proximity can be 


Name: ocean_proximity, dtype: int64 


percentiles analysis of 


housing.describe() 


float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
object 


ad] ER ES PAEAN 1 
lumped rNtTO Carcegor: 


count 

mean 
std 
min 
25% 
50% 
75% 


max 


longitude 
20640.000000 
-119.569704 
2.003532 
-124.350000 
-121.800000 


-118.490000 
-118.010000 
-114.310000 


%matplotlib inline 


latitude 
20640.000000 


35.631861 
2.135952 

32.540000 
33.930000 
34.260000 
37.710000 
41.950000 


import matplotlib.pyplot as plt 


housing_median_age 


20640.000000 


28.639486 
12.585558 
1.000000 

18.000000 
29.000000 
37.000000 
52.000000 


housing.hist(bins-50, figsize-(20,15)) 


array([[<matplotlib.axes 


. subplots. 


92b5438>, 


<matplotlib. 


5f1c2e8>, 


<matplotlib. 


5f39d68>], 


[<matplotlib. 


5eaf7b8>, 


<matplotlib. 


5e7acco>, 


<matplotlib. 


5e40438>], 


[<matplotlib. 


5e0a860>, 


<matplotlib. 


5dce198>, 


cmatplotlib.axes. subplots. 


axes. subplots. 


axes. subplots. 


axes. subplots. 


axes. subplots. 


axes. subplots. 


axes. subplots. 


axes. subplots. 


5d1d1d0>]], dtype-object) 


AxesSubplot 


AxesSubplot 


AxesSubplot 


AxesSubplot 


AxesSubplot 


AxesSubplot 


AxesSubplot 


AxesSubplot 


AxesSubplot 


object 


object 


object 


object 


object 


object 


object 


object 


object 


at 


at 


at 


at 


at 


at 


at 


at 


at 


total r 
20640.0( 
2635.76: 
2181.61: 
2.00000( 
1447.75 


2127.00( 
3148.00( 
39320.0( 


0x7f4a7 
0x7f4a7 
0x7f4a7 
0x7f4a7 
0x7f4a7 
0x7f4a7 
0x7f4a7 
0x7f4a7 


0x7f4a7 
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households housing median age latitude 
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Create a test set 


# split dataset into training (80%) and test (20%) subsets 
import numpy as np 


def split train testi 
data, test ratio): 


shuffled indices - np.random.permutation(len(data)) 
test set size = int(len(data) * test ratio) 


test indices = shuffled indices|:test set _ size] 
train indices = shuffled indices|test set size:] 


return data.iloc[train_indices], data.iloc[test_indices] 


ch02 cal housing analysis.md 


train set, test set - split train test(housing, 0.2) 


print(len(train set), "train +", len(test set), "test") 


16512 train + 4128 test 


# create method for ensuring consistent test sets across multipl 
e runs 

# (new test sets won't contain instances in previous training se 
ES.) 


# example method: 

# compute hash of each instance 

# keep only the last byte 

# include instance in test set if value < 51 (20% of 256) 


import hashlib 


def test set check( 
identifier, test ratio, hash): 


return hash(np.int64(identifier)).digest()[-1] < 256 “ test 
ratio 


def split train test by idi 
data, test ratio, id column, hash-hashlib.md5): 


ids = data[id_ column] 
in_test_set = ids.apply( 
lambda id_: test_set_check( 


id_, test_ratio, hash)) 


return data.loc[-in test set], data.locfin test. set] 


9 


) let 


housing with id = housing.reset index() 


train set, test set = split train test by id( 


housing with id, 0.2, 


train set.head() 


index 
0 0 
117 
2 2 
3 3 
6 6 


longitude 
-122.23 
-122.22 
-122.24 
-122.25 
-122.25 


test_set.head() 


index 
4 4 
SNS 
11 11 
20 20 


longitude 
-122.25 
-122.25 
-122.26 
-122.27 
-122.27 


l'index") 
latitude housing median age 
37.88 41.0 
37.86 21.0 
37.85 52.0 
37.85 52.0 
37.84 52.0 
latitude housing median age 
37.85 52.0 
37.85 52.0 
37.85 52.0 
37.85 40.0 
37.84 52.0 


total rooms 
880.0 
7099.0 
1467.0 
1274.0 
2535.0 


total_rooms 
1627.0 
919.0 
3503.0 
751.0 
1688.0 


housing with_id["id"] = housing["longitude"] * 1000 + housing["1 
atitude"] 


train_set, test_set = split_train_test_by_id( 
housing with id, 0.2, "id") 


train set.head() 


index longitude latitude housing median age total rooms 


0 0 -122.23 37.88 41.0 880.0 

111 -122.22 37.86 21.0 7099.0 
2 2 -122.24 37.85 52.0 1467.0 
313 -122.25 37.85 52.0 1274.0 
4 4 -122.25 37.85 52.0 1627.0 


test set.head() 


index longitude latitude housing median age total rooms 


8 8 -122.26 37.84 42.0 2555.0 
10 10 -122.26 37.85 52.0 2202.0 
11 11 -122.26 37.85 52.0 3503.0 
12 12 -122.26 37.85 52.0 2491.0 


13 13 -122.26 37.84 52.0 696.0 


from sklearn.model selection import train test split 


train set, test set - train test split( 


housing, 


test set.head() 


20046 
3024 
15663 
20484 
9814 


longitude 
-119.01 
-119.46 
-122.44 
-118.72 
-121.93 


train set.head() 


14196 
8267 
17445 
14265 
2271 


longitude 
-117.03 
-118.16 
-120.48 
-117.11 
-119.80 


test_size=0.2, 


latitude 
36.06 
35.14 
37.80 
34.28 
36.62 


latitude 
32.71 
33.77 
34.66 
32.69 
36.78 


random state-1?) 


housing median age 
25.0 
30.0 
52.0 
17.0 
34.0 


housing median age 
33.0 

49.0 

4.0 

36.0 


housing| 'median_income'].hist(bins=5) 


total rooms 
1505.0 
2943.0 
3830.0 
3051.0 
2351.0 


total rooms 
3126.0 
3382.0 
1897.0 
1421.0 
2382.0 


cmatplotlib.axes. subplots.AxesSubplot at 0x7f15f7250588> 


tot 
Na 
Na 
Na 
Na 
Na 


tot 
62; 
78, 
33 
36: 
43° 


ch02 cal housing analysis.md 








10000 


8000 


6000 


4000 








housing.describe() 


count 

mean 
std 
min 
25% 
50% 
75% 


max 


housing["income_cat"]=np.ceil(housing["median_income"]/1.5) 


longitude 
20640.000000 
-119.569704 
2.003532 
-124.350000 
-121.800000 
-118.490000 
-118.010000 
-114.310000 


latitude 
20640.000000 
35.631861 
2.135952 
32.540000 
33.930000 
34.260000 
37.710000 
41.950000 





housing_median_age 
20640.000000 
28.639486 

12.585558 

1.000000 

18.000000 

29.000000 

37.000000 

52.000000 


total_ri 
20640.0( 
2635.76: 
2181.61! 
2.00000( 
1447.75 
2127.00( 
3148.00( 
39320.0( 


housing["income_cat"].where(housing["income_cat"]<5, 5.0, inplac 


e=True) 


housing.describe() 


count 
mean 
std 
min 
25% 
50% 
75% 


max 


from sklearn.model selection import StratifiedShuffleSplit 


split 


for train index, 


longitude 
20640.000000 
-119.569704 
2.003532 


-124.350000 
-121.800000 
-118.490000 
-118.010000 
-114.310000 


latitude housing median age 


20640.000 
35.631861 
2.135952 


32.540000 
33.930000 
34.260000 
37.710000 
41.950000 


StratifiedShuffleSplit ( 
n splits-1, test size-0.2, 


ome cat"): 


strat train set 


strat test set 


evi ew 
EAA A 


Income 


000 20640.000000 
28.639486 
12.585558 


1.000000 

18.000000 
29.000000 
37.000000 
52.000000 


random state=12) 


housing.loc[train_index] 


housing.loc[test_index] 


t1ons 


housing["income_cat"].value_counts() / len(housing) 


3.0 
2.0 
4.0 
5.0 
1.0 


. 350581 
. 318847 
.176308 
. 114438 
0.039826 


0 
0 
0 
0 


Name: income_cat, dtype: float64 


total_ri 
20640.0( 
2635.76: 
2181.61£ 


2.000006 
1447.75( 
2127.00( 
3148.00( 
39320.0( 


test index in split.split(housing, housing["inc 
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# remove income cat attribute (return dataset to original state) 
for set in (strat train set, strat test set): 


set.drop(["income_cat"], axis=1, inplace=True) 


Visualization 


housing = strat train set.copy() 
# first: basic geographic distribution 


housing.plot(kind-"scatter", x="longitude", y="latitude", alpha- 


0.1) 


<matplotlib.axes. subplots.AxesSubplot at 0x7f15f71c4668> 


de 
Ge 


latitu 











T T T T Tr 
22 -120 -118 -116 -114 


longitude 
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next: housing prices. 
color - price 
radius - population 


+ + + + 


use predefined "jet" color map 


housing. plot ( 
kind="scatter", 
x="longitude", 
y="latitude", 
alpha-0.4, 
#s-housing| "population" | .apply(lambda n: n/100), 
s=housing["population"]/100, 
label="population", 
c="median_house_value", 
cmapzplt.get cmap("jet"), 
colorbar=True, 


) 
plt.legend() 


<matplotlib.legend.Legend at 0Ox7f15f5eff1d0> 
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Correlations 
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# next: look for correlatons to median house value. 


corr matrix = housing.corr() 


corr matrix| 'median house value'].sort values(ascending-False) 


median house value 1.000000 
median income 0.687160 
total rooms 0.135097 
housing median age 0.114110 
households 0.064506 
total bedrooms 0.047689 
population -0.026920 
longitude -0.047432 
latitude -0.142724 


Name: median house value, dtype: float64 

# another way of looking for correlations: scatter matrix 

# focus on top 3 factors from above 

from pandas.tools.plotting import scatter matrix 

attributes = ("median house value", "median income", "total room 
Ss" 

"housing median age") 


scatter_matrix(housing[attributes], figsize=(12, 8)) 


array([[<matplotlib.axes._subplots 
5fc20f0>, 

<matplotlib.axes. subplots 
5eda860>, 

<matplotlib.axes. subplots 
602f898>, 

<matplotlib.axes. subplots 
5ff2860>], 

[<matplotlib.axes. subplots 
5f7dd68>, 

<matplotlib.axes. subplots 
5e836d8>, 

<matplotlib.axes. subplots 
460e710>, 

<matplotlib.axes. subplots 
45d50b8>], 

[<matplotlib.axes. subplots 
45224e0>, 

<matplotlib.axes. subplots 
454ab38>, 

cmatplotlib.axes. subplots 
449ec88>, 

<matplotlib.axes. subplots 
44e5898>], 

[<matplotlib.axes. subplots 
44380b8>, 

<matplotlib.axes. subplots 
4450be0>, 

<matplotlib.axes. subplots 
43c2da0>, 

<matplotlib.axes. subplots 
43960f0>]], dtype=object ) 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


.AxesSubplot 


object 


object 


object 


object 


object 


object 


object 


object 


object 


object 


object 


object 


object 


object 


object 


object 


at 


at 


at 


at 


at 
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at 


at 


at 


at 


at 


at 


at 


at 


at 


at 


0x7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 


Ox7f15f 
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# median house value to median income seems to be the most promi 
sing. 
# let's zoom in. 


housing.plot ( 


kind="scatter", x-"median income", y-"median house value", 
alpha=0.1) 


<matplotlib.axes. subplots.AxesSubplot at 0x7f15f41b1a90> 
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500000 
400000 
w' 300000 


| 
200000 


median house value 





0 2 4 6 8 10 12 14 
median income 


# combine some attributes to create more useful ones 
# then rebuild the correlation matrix. 


housingf| "rooms per household") = housing["total_rooms"]/housing[ 
"households"] 

housing["bedrooms_per_room"] = housing["total_bedrooms" ]/housing[ 
"total rooms") 

housing[ "population per household" J-housing| "population" )/housin 
g["households"] 


corr matrix = housing.corr() 
corr_matrix[ 'median_house_value'].sort_values(ascending=False) 


# *** NOTE: rooms_per_household corr (in book) show more improve 
ment, -0.199 
# compared to our 0.146. Not sure of root cause yet. *** 


El — E 
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median house value 1.000000 
median income 0.687160 
rooms per household 0.146285 
total rooms 0.135097 
housing_median_age 0.114110 
households 0.064506 
total bedrooms 0.047689 
population per household -0.021985 
population -0.026920 
longitude -0.047432 
latitude -0.142724 
bedrooms_per_room -0.259984 


Name: median house value, dtype: float64 


Data Cleanup 


# revert to clean copy of stratified training dataset 
# separate predictors from labels 


housing = strat train set.drop("median house value", axis=1) 
housing labels = strat train set|"median house value"].copy() 
# 'total bedrooms' has some missing values - fix 

# can use DataFrame dropna(), drop(), fillna() 

# use Scikit-Learn class to handle missing values 


from sklearn.preprocessing import Imputer 
imputer = Imputer(strategy-"median") 


# drop ocean proximity attribute, since it's non-numeric. 
# then fit to training data. 


housing num - housing.drop("ocean proximity", axis=1) 
imputer.fit(housing num) 


Imputer(axis-0, copy-True, missing values-'NaN', strategy-'media 
n', verbose=0) 


# now what do we have? 
imputer.statistics_ 


array([ -118.51 , 34.26 , 29. , 2119.5 , 433. 


1 


1164. , 408. , 3.5409]) 


housing num.median().values 


array([ -118.51 , 34.26 , 29. , 2119.5 , 433. 


7 


1164. , 408. r 3.5409] ) 


# update training set by replacing missing values with learned m 
edians 
X = imputer.transform(housing_num) 


pd.DataFrame(X, columns-housing num.columns).info() 


<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 16512 entries, @ to 16511 


Data columns (total 8 
longitude 

latitude 

housing median age 
total rooms 

total bedrooms 
population 

households 

median income 

dtypes: float64(8) 


memory usage: 1.0 MB 


columns): 


16512 
16512 
16512 
16512 
16512 
16512 
16512 
16512 


feature to 


non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 


float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 


numbers 


using LabelEncoder 


from sklearn.preprocessing import LabelEncoder 


encoder = 


LabelEncoder() 


housing_cat = housing['ocean_proximity'] 


housing_cat_encoded = encoder.fit_transform(housing_cat) 


housing_cat_encoded 


array([0, 0, 4, 


j 


+ how is 


ocean proxil 


many 


print(encoder.classes ) 


['<1H OCEAN' 'INLAND' 


'ISLAND' 


a ály 0, 3]) 


"NEAR 


BAY ' 


"NEAR OCEAN'J 


# a better solution for categorical data: one-hot encoding 


from sklearn.preprocessing import OneHotEncoder 
encoder - OneHotEncoder() 


# output - SciPy sparse matrix, better for memory usage 
# if you need a dense NumPy array, call toarray() 


housing cat 1hot = encoder.fit transform(housing cat encoded.res 


hape(-1,1)) 
housing cat 1hot 


<16512x5 sparse matrix of type '<class 'numpy.float64'>' 
with 16512 stored elements in Compressed Sparse Row format> 


Label Binarization: 


e A shortcut (text categories => integer categories => one-hot vectors) 


from sklearn.preprocessing import LabelBinarizer 
encoder = LabelBinarizer() 


housing_cat_1hot = encoder .fit transform(housing. cat) 
housing cat 1hot 


array([[1, 0, 0, O, ol, 
[1, 0, 0, 0, Ol, 
[0, ©, 0, 0, 1], 


[0, 1, 0, 0, Ol, 


[1, ©, ©, 0, Ol, 
[0, ©, ©, 1, 0]]) 


Custom Transformers: 


e Create your own using SciKit-Learn classes 
e implement fit(), transform() and fit transform() methods 
e (fit transform comes for free by using TransformerMixin as a base class.) 


from sklearn.base import BaseEstimator, TransformerMixin 
rooms ix, bedrooms ix, population ix, household ix = 3, 4, 5, 6 
class CombinedAttributesAdder (BaseEstimator, TransformerMixin ): 


def _ init__(self, add bedrooms per room = True): + no *args 
or **kargs 


self.add bedrooms. per room = add bedrooms. per. room 


def fit(self, X, y-None): 
return self # nothing else to do 


def transform(self, X, y-None): 
rooms. per household = X[:, rooms ix] / X[:, househo 
1d ix] 
population per household = X[:, population ix] / X[:, ho 
usehold ix] 


if self .add bedrooms. per room: 
bedrooms. per room = X[:, bedrooms ix] / X[:, rooms i 


return np-c IX, 
rooms per household, 
population per household, 
bedrooms per room] 

else: 

return np.c IX, 
rooms per household, 
population per household] 


attr adder - CombinedAttributesAdder(add bedrooms per room-False 


) 


housing extra attribs - attr adder.transform(housing.values) 


Feature Scaling 


Min-max scaling (normalization) = shift & rescale to [0,1] 


SciKit MinMaxScaler will do this for you. 


Standardization subtracts mean & divides by variance - result has unit 
variance 


SciKit StandardScaler does this for you. 


Pipelining 


e SciKit Pipeline class helps to standardize the sequence of transforms you 
need for your project. 

e Pipelines = list of estimator steps. All but the last must be transformers (they 
must have fit transform() method.) 


# "DataFrameSelector" is a custom transformer class. 
# grabs the specified feature, drops the rest, converts the DF i 
nto a NumPy array. 


from sklearn.base import BaseEstimator, TransformerMixin 
class DataFrameSelector(BaseEstimator, TransformerMixin): 
def  init (self, attribute names): 


self.attribute names = attribute names 


def fit (self, X, y-None): 
return self 


def transform (self, X): 
return X[self.attribute names].values 


from sklearn.pipeline import Pipeline, FeatureUnion 
from sklearn.preprocessing import StandardScaler 


num attribs list(housing num) 


cat attribs - ['ocean_proximity'] 


num pipeline = Pipeline(| 


('selector', DataFrameSelector(num attribs)), 
("imputer ', Imputer(strategy-"median")), 
('attribs adder', CombinedAttributesAdder ()), 
('std scaler', StandardScaler()), 

1) 


cat pipeline - Pipeline([ 
('selector', DataFrameSelector(cat attribs)), 
('label binarizer', LabelBinarizer()), 


1) 


full pipeline = FeatureUnion(transformer list -| 
('num pipeline', num pipeline), 
('cat pipeline', cat pipeline) 


1) 


# let's try it out: 


housing prepared - full pipeline.fit transform(housing) 
housing prepared 


array([[-1.15604281, O@.77194962, 0.74333089, ..., 0. P 

0. r 0 1, 

[-1.17602483, 0.6596948 , -1.1653172 , ..., ©. E 
O. , 0. ], 

[ 1.18684903, -1.34218285, 0.18664186, ..., 0. ; 
O. , 1 1, 

ee 

[ 1.58648943, -0.72478134, -1.56295222, ..., 0. , 
O. , 0. ie 

[ 0.78221312, -0.85106801, 0.18664186, ..., ©. , 
O. 7 oe ], 

[-1.43579109, 0.99645926, 1.85670895, ..., 0. , 
i r O: 11) 


housing prepared.shape 


(16512, 16) 


e Note: pip3 install sklearn-pandas => gets a DataFrameMapper class 


Model Selection € Training 


from sklearn.linear_model import LinearRegression 
lin_reg = LinearRegression() 


lin reg.fit(housing prepared, housing_labels) 


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, norm 
alize=False) 


# First try. NOT very accurate. 


some_data housing.iloc[:5] 


some_labels housing_labels.iloc[:5] 


some_data_prepared = full pipeline.transform(some data) 
print ("predictions:\t", lin reg.predict(some data prepared)) 


print ("labels:\t", list(some labels)) 


predictions: [ 210644.60459286 317768.80697211 210956.4333 
1178 59218.98886849 
189747.55849879] 
labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0] 
# why? look at RMSE on whole training set. 


from sklearn.metrics import mean_squared_error 


housing_predictions = lin_reg.predict(housing_prepared) 


lin_mse = mean_squared_error(housing_labels, housing 
_predictions) 
lin_rmse = np.sgrt(lin mse) 


print ("typical prediction error:\t", lin rmse) 


typical prediction error: 68628.1981985 


+ Hmmm. Not good. Underfit situation. 


# Let's try a more powerful model, like a Decision Tree. 


from sklearn.tree import DecisionTreeRegressor 


tree reg - DecisionTreeRegressor() 
tree reg.fit(housing prepared, housing labels) 


DecisionTreeRegressor(criterion='mse', max depth-None, max featu 


res-None, 
max leaf nodes-None, min impurity split-1e-07, 
min samples leaf-1, min samples split-2, 
min weight fraction leaf-0.0, presort=False, random s 
tate=None, 
splitter-'best') 
Zero error? No way 


housing predictions - tree reg.predict(housing prepared) 
tree mse - mean sguared error(housing labels, housing prediction 


S) 


tree_rmse = np.sqrt(tree_mse) 


print ("typical prediction error:\t", tree_rmse) 


typical prediction error: 0.0 


# Use K-fold cross-validation 

# Train & eval Decision Tree model against 10 splits of training 
dataset 

# Returns 10 evaluation scores. 


from sklearn.model_selection import cross_val_score 


scores = cross_val_score( 
tree_reg, 
housing_prepared, 
housing_labels, 
scoring="neg_mean_squared_error", 
cv=10) 


rmse_scores = np.sqrt(-scores) 


def display scores(scores): 
print("Scores:", scores) 
print("Mean:", scores.mean()) 
print( "Standard deviation:", scores.std()) 


display scores(rmse scores) 


Scores: [ 69368.62190153 66248.56520386 72284.6557095 68417. 
57732406 
70049.44916939 74941.75765797 70236.59348749 69466.63688954 
76140.22952307 70217.59755116] 
Mean: 70737.1684418 
Standard deviation: 2815.58298405 


# So, Decision Tree RMSE: mean -71097, stdev 2165 (still sucks.) 
# compare to earlier Linear Regression: 


lin_scores = cross_val_score( 
lin reg, 
housing prepared, 
housing labels, 
scoring-"neg mean sguared error", 
cv=10) 


lin_rmse_scores = np.sqrt(-lin_scores) 


display_scores(lin_rmse_scores) 


Scores: [ 66782.73843989 66960.118071 70347 .95244419 74739. 
57052552 
68031.13388938 71193.84183426 64969.63056405 68281.61137997 
71552.91566558 67665.10082067] 
Mean: 69052.4613635 
Standard deviation: 2731.6740018 


overfit is just about as bad. (RMSE mean 69052, stdev 


from sklearn.ensemble import RandomForestRegressor 


forest reg = RandomForestRegressor() 
forest reg.fit(housing prepared, housing labels) 


forest scores = cross val score( 
forest reg, 
housing prepared, 
housing labels, 
scoring-"neg mean sguared error", 
cv=10) 


forest_rmse_scores = np.sqrt(-forest_scores) 


display_scores(forest_rmse_scores) 


Scores: | 52480.82629458 50035.41358467 53747.69332484 55053. 
95194112 
51800.65152945 55919.01705209 52226.75176017 50912.82366116 
55708.47271341 51931.81080304] 
Mean: 52981.7412665 
Standard deviation: 1929.32402243 


Fine-Tuning Model with Grid Search of 
Hyperparameters 


from sklearn.model selection import GridSearchCV 


param_grid = [ 
“n estitators : (IR, 10, 39], 
‘Max features’: 2. 4, 6, 617, 
('bootstrap': [False], # bootstrap = True = default setting 
'n estimators': [3, 10], 
'max features': (2, 3, 4]}, 


forest reg = RandomForestRegressor() 


grid search = GridSearchCv( 
forest reg, 
param grid, 
CV=5, 
scoring = 'neg mean sguared error") 


grid search.fit(housing prepared, housing labels) 


GridSearchCV(cv=5, error score-'raise', 
estimator=RandomForestRegressor(bootstrap=True, criterion 
='mse', max depth-None, 
max features-'auto', max leaf nodes-None, 
min impurity split-1e-07, min samples leaf-1, 
min samples split-2, min weight fraction leaf-0.0, 
n. estimators-10, n jobs-1, oob score-False, random st 
ate-None, 
verbose=0, warm start-False), 
fit_params={}, iid=True, n_jobs=1, 
param_grid=[{'max_features': [2, 4, 6, 8], 'n_estimators' 
[3, 10, 30]}, {'bootstrap': [False], 'max_features': [2, 3, 4] 
, 'n estimators': [3, 10]}], 
pre_dispatch='2*n_jobs', refit=True, return_train_score=T 
rue, 
scoring='neg_mean_squared_error', verbose=0) 


# Best combination of parameters? 


grid search.best params 


('max features': 6, 'n estimators': 30) 


H 
FE 


Best estimator? 


grid_search.best_estimator_ 


RandomForestRegressor(bootstrap=True, criterion='mse', max depth 


=None, 

max_features=6, max leaf nodes-None, min impurity spl 
it-1e-07, 

min samples leaf-1, min samples split-2, 

min weight fraction leaf-0.0, n estimators-30, n jobs 
Z1, 


oob score-False, random state-None, verbose=0, warm s 


tart-False) 


# Evaluation scores: 


cvres = grid search.cv results 


for mean score, params in zip(cvres|"mean test score"), 


cvres["params"]): 
print(np.sqrt(-mean_score), params) 


63492.9975584 {'max_features': 2, 'n estimators': 3} 
55677.1037862 {'max_features': 2, 'n estimators': 10} 
52917.801725 {'max_features': 2, 'n_estimators': 30} 
60442.2787178 {'max_features': 4, 'n estimators': 3} 
53209.7111283 {'max_features': 4, 'n_estimators': 10} 
50621.1191846 {'max_features': 4, 'n_estimators': 30} 
58591.8196313 {'max_features': 6 
52353.3606044 {'max_features': 6 
49838.3807 {'max_features': 6, ' 
58615.6100561 {'max_features': 
51726.2593734 {'max_features': 8, 'n_estimators': 10} 

50074.3050139 {'max_features': 8, 'n_estimators': 30} 

62010.5215854 {'bootstrap': False, 'max features': 2, 'n_estimat 


, 'n_estimators': 3} 
, 'n_estimators': 10} 
n_estimators': 30} 


8, 'n_estimators': 3} 
8 


Ors’? 3} 

54852.7770725 {'bootstrap': False, 'max_features': 2, 'n_estimat 
Ons 10} 

60246.2164711 {'bootstrap': False, 'max_features': 3, 'n_estimat 
ORS =) 3} 

52752.4109521 {'bootstrap': False, 'max_features': 3, 'n_estimat 
ors': 10} 

58355.1846204 {'bootstrap': False, 'max_features': 4, 'n_estimat 
ors’: 3 


51724.6800894 {'bootstrap': False, 'max_features': 4, 'n_estimat 
ors': 10} 


feature importances = grid search.best estimator .feature import 
ances 
feature importances 


array([ .00229340e-02, 
.13340009e-02, 
.56813711e-02, 
.07645094e-01, 
.57310792e-01, 


.84435167e-03] ) 


.13499357e-02,  4.21346911e-02, 
.55694906e-02,  1.76527489e-02, 
.21068169e-01,  7.54675530e-02, 
.74608930e-02,  1.47327045e-02, 
.20951468e-05,  2.63317542e-03, 


o au eN 


display feature "importance" scores next to their names 


extra_attribs ["rooms_per_hhold", "pop per hhold", "bedr 
ooms. per room") 


cat one hot attribs - list(encoder.classes ) 


attributes num_attribs + extra attribs + cat one hot 


attribs 


sorted(zip(feature importances, attributes), reverse-True) 


[(0.32106816893273865, 'median income"), 
(0.15731079177984286, 'INLAND'), 
(0.10764509417315272, 'pop per hhold'), 
(0.080022934000105003, 'longitude'), 
(0.075467553036607335, 'rooms_per_hhold'), 
(0.071349935674308126, 'latitude'), 
(0.057460893036370447, 'bedrooms_per_room'), 
(0.04213469106714228, 'housing_median_age'), 
(0.017652748894983483, 'population'), 
(0.017334000890698829, 'total rooms'), 
(0.015681371107232313, 'households'), 
(0.015569490624941605, 'total bedrooms'), 
(0.014732704544371122, '<1H OCEAN'), 
(0.0038443516681782959, 'NEAR OCEAN'), 
(0.00263317542255579, 'NEAR BAY'), 
(9.2095146771177451e-05, 'ISLAND' )] 


Time to Eval System on Test dataset 


final model - grid search.best estimator 


X test = strat test set.drop("median house value", axis- 
1) 
y_test = strat_test_set["median_house_value"].copy() 


X test prepared = full_pipeline.transform(X_test) 


final predictions - final model.predict(X test prepared) 
final mse = mean sguared error(y test, final predictions) 
final rmse - np.sgrt(final mse) 

final rmse 


AAA |) 


47574.62166586089 


MNIST: the "Hello World" of Machine Learning 


# Alternative local file loader (due to mldata.org being down) 


from scipy.io import loadmat 
mnist raw = loadmat("mnist-original.mat") 
mnist = { 
"data": mnist_raw["data"].T, 
"target": mnist_raw["label"][0], 
“COL_NAMES™: ["label", "data", 
"DESCR": "mldata.org dataset: mnist-original", 


} 


# 70K images, 28x28 pixels/image, each pixel = 0 (white) to 255 
(black) 
mnist # a dict object 


{'COL_NAMES': ['label', 'data'], 
'DESCR': 'mldata.org dataset: mnist-original', 
'data': array([[0, ©, ©, ..., ©, ©, 0], 
(6 6: O aean O & Gi 
lO o O ccas O a ol 


1 


je ©, ©, NG) 


0, 0], 
ROIS EN Ga acar O o e 
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 
“target : array([ @., 0., O., ..., 9., Y 9.])} 


# take a peek 
X,y = mnist['data'], mnist['target'] 


X.shape, y.shape 


((70000, 784), (70000, )) 


# display example image 


%matplotlib inline 
import matplotlib 
import matplotlib.pyplot as plt 


some_digit = X[36000] 
some_digit_image = some_digit.reshape(28, 28) 


pit.imshow( 
some_digit_image, 
cmap = matplotlib.cm.binary, 
interpolation-"nearest") 


plt.axis("off") 
plt.show() 





# looks like a "five". What's the corresponding label? 
y[36000] 


# dataset already split into training (1st 60K) & test (last 10K 
) images. 
# shuffle training set for cross-validation guality 


X train, X test, y train, y test - X[:60000], X[60000:], y[:60000 
1, y[60000: ] 


import numpy as np 


shuffle_index = np.random.permutation(60000) 
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index 


] 


print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) 


Ja au 
(60000, 784) (10000, 784) (60000,) (10000, ) 


Binary classifier training - distinguish between 2 
classes 


e Using Stochastic Descent 


# Start by only trying to ID "five" digits. 


y train 5 = (y train == 5) + create target vectors 
y test 5 = (y test == 5) 


print(y train 5.shape, y train 5) 
print(y test 5.shape, y test 5) 


# SGD classifier: good at handling large DBs 
# also good at handling one-at-a-time learning 


from sklearn.linear model import SGDClassifier 
sgd clf - SGDClassifier(random_state=42) 
sgd clf.fit(X train, y train 5) 


# did it correctly predict the "five" found above? 


print(sgd clf.predict( (some digit))) 


(60000,) [False False False ..., False False False] 
(10000, ) [False False False ..., False False False] 
[ True] 


Performance Measures 
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# measure accuracy using K-fold (n=3) cross-validation scores 


from sklearn.model_selection import cross_val_score 


print(cross_val_score( 
sgd_clf, 
X_train, 
y train 5, 
CV=3, 
scoring-"accuracy")) 


# 90% accuracy = pretty easy when 90% of digits aren't fives to 
begin with ... :-| 


[ 0.96795 0.96975 0.96855] 


44 


# rolling your own cross-validation. Results should be similar-i 


from sklearn.model selection import StratifiedKFold 
from sklearn.base import clone 


skfolds = StratifiedKFold(n splits-3, random states42) 


for train index, test index in skfolds.split(X train, y train 5) 


Clone clf = clone(sgd c1f) 


X train folds = X train|train index] 
y_train_folds = (y train 5[train index)) 
X test fold - X train|test index] 
y test fold = (y_train_5[test_index]) 


clone clf.fit(X train folds, y_train_folds) 
y_pred = clone clf.predict(X test fold) 
n_correct = sum(y pred == y test fold) 


print(n correct / len(y pred)) 


0.96795 
0.96975 
0.96855 


# 95X accuracy sounds too good to be true. How about not-fives? 
from sklearn.base import BaseEstimator 


class Never5Classifier(BaseEstimator): 
def fit(self, X, y-None): 
pass 


def predict(self, X): 
return np.zeros((len(X), 1), dtype-bool) 


never 5 clf = Never5Classifier() 


print(cross val score( 
never 5 clf, 
x train, 
y_train_5, 
CV=3, 
scoring-"accuracy")) 


[ 0.9096 0.9124 0.90695] 


# only -10% of images are "five", so -90% of images are "not fiv 
LL 


e 
# You SHOULD be right about 90% of the time. :-) 


# Lesson Learned: 
# Accuracy not a good metric for classifiers - esp those with sk 
ewed datasets. 


Confusion Matrix - a better way of evaluating a 
classifier 


# general idea: count #times instances of A are classified as B. 
# first, need a set of predictions. 


from sklearn.model selection import cross val predict 
# Generate cross-val'd predictions for each datapoint 
y_train_pred - cross val predict(sgd clf, X train, y train 5, cv- 


3) 


# ROWS = actual classes 
# COLS - predicted classes 


from sklearn.metrics import confusion matrix 


print(confusion matrix(y train 5, y train pred)) 


PET mu ri 





[[54044 535] 
[ 1340 4081]] 


Classifier metrics: precision = TP/(TP+FP); recall 
(sensitivity) = TP/(TP+FN) 


Predicted 
TN Nogatwo Positive FP 
~) — 





Precision 
(ag. Jout of 4) 





print(3841 / (384141515), 3841/(3841+1580) ) 


0.7171396564600448 0.7085408596199964 


# precision, recall, f1 metrics 
# precision/recall tradeoff: increasing one reduces the other. 


from sklearn.metrics import precision score, recall score, f1 sc 
ore 

print("precision:Nn",precision score(y train 5, y train pred)) 
print("recall:Nn", recall score(y train 5, y train pred)) 





# F1 score favors classifiers with similar precision & recall. 
print("f1:Nn",f1 score(y train 5, y train pred)) 





precision: 
0.884098786828 

recall: 
0.752813134108 

f1: 
0.813191192587 


Precision/Recall Tradeoffs 


# Scikit doesn't let you directly set threshold values (which dr 
ive the decision 

# function for precision/recall.) But you can use the decision f 
unction itself. 


y_scores = sgd_clf. decision function([some_digit]) 
print(y_scores) 


threshold = 0 
y_some_digit_pred = (y_scores > threshold) 
print(y_some_digit_pred) 


[ 57844.42736708 ] 
[ True] 


# raising the threshold reduces recall... 


threshold - 200000 
y some digit pred = (y scores > threshold) 
print(y some digit pred) 





[False] 
Precision: 6/8 = 75% 4/5 = 80% 3/3 = 100% 
Recall: 6/6 = 100% 416 = 67% 3/6 = 50% 
#6 IIS S5 
Score 
Negative predictions iN A 7 Positive predictions 


Various thresholds 


# how to find the right threshold? 
# start with getting decision scores instead of predictions. 


y_scores = cross val predict( 
sgd clf, 
X train, 
y train 5, 
CV=3, 
method-"decision function") 


# use results to build a precision/recall curve 


from sklearn.metrics import precision recall curve 
precisions, recalls, thresholds = precision recall curve(y train 
_5, Y scores) 


# plot the result 


def plot precision recall vs threshold(precisions, recalls, thre 
sholds): 
plt.plot(thresholds, 
precisions[:-1], 
"b--", 
label="Precision") 
plt.plot(thresholds, 
recalls[:-1], 
Me 
label="Recall") 
pit.xlabel("Threshold") 
plt.legend(loc="upper left") 
plt.ylim([0, 1]) 


plot_precision_recall_vs_threshold(precisions, recalls, threshol 
ds) 
plt.show() 
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# plot precision vs recall to look for knee of the curve 


def plot precision vs recall(precisions, recalls): 
plt.plot(recalls, precisions, "b-", linewidth=2) 
plt.xlabel( "Recall", fontsize=16) 
plt.ylabel( "Precision", fontsize-16) 
plr-axisi (Or 1, 0; 11) 


plt.figure(figsize-(10, 4)) 
plot precision vs recall(precisions, recalls) 
plt.show() 


Precision 











Recall 


al 





curve suggests setting thresho 


y train pred 90 = (y scores > 50000) 
print(y train pred 90.shape, y train pred 90) 


print("precision:Nn",precision score(y train 5, y train pred 90) 


) 


print("recall:Nn", recall score(y train 5, y train pred 90)) 





(60000,) [False False False ..., False False False] 
precision: 

0.924948770492 

recall: 

0.666113263236 


ROC (Receiver Operating Characteristic) curve 


ch03 classification.md 


# ROC plots TRUE POSITIVE rate (TP = recall) vs FALSE POSITIVE r 
ate. (FP = 1-specificity) 


from sklearn.metrics import roc_curve 
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores) 


def plot_roc_curve(fpr, tpr, label-None): 
plt.plot(fpr, tpr, linewidth=2, label-label) 
plti plot lo 1], [0, 11, "kK--") 
Dlt. axis(je, 1 6. dj) 
plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate') 


plot roc curve(fpr, tpr) 
plt.show() 


# tradeoff: higher recall (TP) => more false positives produced. 
# dotted line = purely random classifier results. 


te 


rue Positive Ra 
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False Positive Rate 
# area under curve (AUC) metric: 
# perfect score = ROC AUC = 1.0 


# random score = ROC AUC = 0.5 


from sklearn.metrics import roc_auc_score 





print(roc auc score(y train 5, y_scores)) 


0.964880839199 


# train Random Forest classifier 
# compare its ROC curve € AUC to SGD classifier 


from sklearn.ensemble import RandomForestClassifier 


forest clf = RandomForestClassifier(random_state=42) 


# Random Forest doesn't have decision_function(); use predict_pr 
oba() instead. 
# returns array (row per instance, column per class) 


y_probas_forest = cross val _predict( 
forest clf, 
A train, 
y_train_5, 
CV=3, 
method-"predict proba") 


# To plot ROC curve, you need scores - not probabilities. 
# use positive class probability as the score. 


y scores forest = y_probas_forest[:, 1] 
fpr forest, tpr forest, thresholds forest = roc curve(y train 5, 
y scores forest) 


# plot ROC curve 

plt.plot(fpr, tpr, "b:", label="SGD") 

plot roc curve(fpr forest, tpr forest, "Random Forest") 
plt.legend(loc="lower right") 

plt.show() 








of —— Random Forest 





DO T T T T 
DO 02 04 06 08 10 
False Positive Rate 


# Random Forest curve looks much steeper (better). How's the ROC 
AUC score? 


print(roc auc score(y train 5, y scores forest)) 





# How's the precision & recall? 
y train pred forest = cross val predict( 
forest clf, 
X train, 
y train 5, 
cv=3) 


print(precision_score(y_train_5, y_train_pred_forest)) 


print(recall score(y train 5, y_train_pred_forest)) 


0.992589481683 
0.985567461185 
0.831396421324 


Multiclass Classification 


C 
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# some algorithms (RF, Bayes, ..) can handle multiple classes 
# others (SVMs, linear, ...) cannot 


# one-vs-all (OVA) strategy for 0-9 digit classication: 
# 10 binary classifiers, one for each digit -- select class with 
highest score 


# one-vs-one (OVO) strategy: 
# train classifiers for every PAIR of digits -- N*(N-1)/2 classi 
fiers needed! 


# Scikit detects using binary classifier when multi-class proble 
m is present, 
# auto-selects OVA. 


sgd clf.fit(X train, y train) 
print(sgd clf.predict( (some digit])) # can SGD correctly predict 
the "five"? 


[ 5.1 


# let's see 10 scores, one per class. 
# highest score corresponds to "five". 


some digit scores = sgd_clf.decision_function([some_digit]) 
print(some digit scores) 
print(sgd clf.classes ) 


[[-177277.32782496 -561668.18573184 -385895.43788059 -114677.953 
60751 

-410210.58824666 57844.42736708 -654717.63929413 -200777.651 
0135 

-772154.70175904 -614737.18986655] ] 
[ 6: Ll. 22 8. A. 5. 6 Y. 8. 9] 


al 
O) 


# to force Scikit to use OVO (in this case) or OVA: use correspo 
nding classifier. 


from sklearn.multiclass import OneVsOneClassifier 


ovo clf = OneVsOneClassifier(SGDClassifier(random_state=42)) 
ovo clf.fit(X train, y train) 
print("prediction:\n",ovo_ clf.predict( (some digit ] ) ) 


# same thing for Random Forest (RF can directly handle multiple 
Classifications) 

forest clf.fit(X train, y train) 

print("prediction via Random Forest:\n",forest_clf.predict( [some 
_digit])) 

print( "probability via Random Forest:\n", forest clf.predict prob 
a( (some digit])) 


prediction: 

[ 5.] 

prediction via Random Forest: 

[ 5.] 

probability via Random Forest: 

[[ 0.1 0. 0. 0.1 0. 0.8 0. 0. ©. 0. ]] 


H let's check these classifiers via CV. SGD first. 


print("CV score:\n'",cross val score( 
sgd clf, 
X train, 
y train, 
CV=3, 
scoring-"accuracy")) 


CV score: 
[ 0.84843031 0.85419271 0.81062159] 


# scaling the inputs should help improve the scores. 


from sklearn.preprocessing import StandardScaler 
scaler - StandardScaler() 


X train scaled - scaler.fit_transform(X_train.astype(np.float64) 


) 


print("CV score, scaled inputs:\n",cross_val_score( 
sgd clf, 
X train scaled, 
y train, 
CV=8, 
scoring-"accuracy")) 


CV score, scaled inputs: 
[ 0.91011798 0.91089554 0.90908636] 


Error Analysis 


# as earlier: a confusion matrix from the SGD classificer 


y_train_pred = cross_val_predict( 
sgd_clf, 
X_train_scaled, 
y_train, 
CV=3) 


conf mx = confusion matrix(y train, y train pred) 
print("confusion matrix:Nn",conf mx) 


# image eguivalent 
plt.matshow(conf mx, cmap=plt.cm.gray) 
plt.show() 


confusion matrix: 

[15785 4 24 11 13 45 43 8 37 3] 

[ 1 6489 43 24 6 35 8 8 116 12] 
57 38 5329 88 79 27 92 60 174 14] 
53 41 140 5333 2 234 35 60 142 91] 
17 26 36 10 5371 8 48 30 77 219) 
69 38 39 185 76 4600 114 28 175 97] 
34 24 42 2 41 95 5625 7 48 0] 
22 21 64 31 49 9 8 5792 14 255] 
54 157 70 148 14 158 58 28 5029 135] 
42 37 25 85 155 36 2 193 75 5299]] 


[ 
[ 
[ 
[ 
[ 
[ 
[ 
[ 





# focus on errors. 

# 1st: divide each value in confusion matrix by #images in corre 
sponding class 

# (compares error rates instead of #errors) 


row_sums = conf_mx.sum(axis=1, keepdims=True) 
norm_conf_mx = conf_mx / row_sums 


# fill diagonals with zeroes to keep only the errors, and plot. 
# brighter colors = more misclassifications 


np.fill diagonal(norm conf mx, 0) 
plt.matshow(norm conf mx, cmap=plt.cm.gray) 
plt.show() 


# rows = actual classes 
# cols = predicted classes 
# 8s € 9s are a problem. 





# more on analyzing individual errors 


# EXTRA 
def plot digits(instances, images per row-10, **options): 
size - 28 
images per row - min(len(instances), images per row) 
images = [instance.reshape(size,size) for instance in instan 
ces] 
n_rows = (len(instances) - 1) // images_per_row + 1 
row_images = [] 
n_empty = n_rows * images_per_row - len(instances) 
images.append(np.zeros((size, size * n_empty))) 
for row in range(n rows): 
rimages = images[row * images_per_row : (row + 1) * imag 
es_per_row] 
row images.append(np.concatenate(rimages, axis=1)) 
image = np.concatenate(row images, axis=0) 
plt.imshow( image, cmap - matplotlib.cm.binary, **options) 
plt.axis("off") 


cl a, clb = 3, 5 


X_aa = X_train[(y_train == cl_a) € (y_train_pred == cl_a)] 
X_ab = X_train[(y_train == cl a) € (y_train_pred == cl b)] 
X_ba = X_train[(y_train == cl b) € (y_train_pred == cl a)] 
X_bb = X_train[(y_train == cl b) € (y_train_pred == cl_b)] 


plt.figure(figsize=(8,8)) 

plt.subplot(221): plot_digits(X_aa[:25], images_per_row=5) 
plt.subplot(222): plot digits(X ab[:25], images_per_row=5) 
plt.subplot (223); plot digits(X ba[:25], images per row-5) 
plt.subplot(224): plot digits(X bb[:25], images per row-5) 
plt.show() 


# shows difficulty in seeing difference between threes and fives. 


# We used SGDclassifier, which is sensitive to image shifts/rota 
tes. 
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MultiLabel Classification 
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# use case: returning multiple classes for each instance 
# (example: multiple people's faces in one picture.) 


# create y_multilabel array with 2 target labels for each digit 
image: 
# first = large digit (7,8,9)?; second = odd (1,3,5,7,9)? 


from sklearn.neighbors import KNeighborsClassifier 


y_train_large = (y_train >= 7) 
y_train_odd = (y_train % 2 == 1) 


print("large nums?\n",y_train_large) 
print("odd nums?\n",y_train_odd) 


y_multilabel = np.c_[y_train_large, y_train_odd] 


print("combined (multilabel)?\n", y_multilabel ) 


# KNeighbors classifier supports multilabeling 


knn_clf = KNeighborsClassifier() 
knn clf.fit(X train, y multilabel) 


# make example prediction using "some digit" from above 
# >= 7 = false (correct); odd digit = true (correct) 


print("KNN prediction of some digit: (>=7? odd?)\n",knn_clf.pred 
ict( (some digit))) 


large nums? 


[False False True ..., False False False] 
odd nums? 
[False False True ..., False False False] 


combined (multilabel)? 
[[False False] 
[False False] 
[ True True] 
[False False] 
[False False] 
[False False]] 
KNN prediction of some digit: (>=7? odd?) 
[[False True] ] 


y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, 
CV=3) 


print(f1 score( 
y train, 
y train knn pred, 
average="macro")) se "weighted" if more weight to be give 


common labels. 
0.968186511757 


MultiOutput Classification 


# generalization of multilabel, where each label can have multip 
le values. 
# example: build image noise removal system 


# start by adding noise to MNIST dataset 


import numpy.random as rnd 


noise - rnd.randint(0, 100, (len(X train), 784)) 
X train mod - X train t noise 

noise = rnd.randint(0, 100, (len(X test), 784)) 
X_test_mod = X_test + noise 


y_train_mod = X_train 
y_test_mod = X_test 


some_index = 5500 


def plot_digit(data): 
image = data.reshape(28, 28) 
plt.imshow(image, cmap = matplotlib.cm.binary, 
interpolation="nearest" ) 
plt.axis("off") 


# train classifier, and clean up the image 
knn_clf.fit(X_train_mod, y_train_mod) 
clean_digit = knn_clf.predict([X_test_mod[some_index] ] ) 


plot digit(clean digit) 





some index = 5500 


plt.subplot(121): plot digit(X test mod[some index]) 
plt.subplot(122): plot digit(y test modf|some index]) 
#save fig("noisy digit example plot") 

plt.show() 


# left: noisy image: right: cleaned up 


Training Models - Intro 


Linear Regression 


e y = theta0 + (theta1 x1) + (theta2 x2) + ... 


e = h(theta)(x) 


e =theta*T (dot) x --- theta*T = theta vector, transposed (row instead of col) 


e Training a model = finding theta that minimizes error function (ex: MSE) 


Normal Equation: finds theta that minimizes cost 


function 


# generate some data 
import numpy as np 
X = 2 * np.random.rand(100, 1) 


y=4+3* X+ np.random.randn(100, 1) 


%matplotlib inline 
import matplotlib.pyplot as plt 


plt.scatter(X,y) 
plt.show() 














# find theta. 
# 1) use NumPy's matrix inverse function. 


# 2) use dot method for matrix multiply. 


X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance 
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y) 


# results: 
print(theta_best) + compare to generated data: y = 4 + 3x + noise 


Ia uu 


IL 3.58859665] 
[ 3.41876053]] 


# make some predictions 


X new = np.array([[0],[1],[2]]) 
X_new_b = np.c_[np.ones((3, 1)), X_new] + add xO = 1 to each i 
nstance 


y_predict = X_new_b.dot(theta_best) 
print(y_predict) 


# then plot 


plt.plot(X_new, y_predict, "r-") 
plt.plot(X, y, "b.") 
plt.aas([e, 2, 0, 151) 
plt.show() 


IL 3.58859665] 
[ 7.00735719] 
[ 10.42611772]] 

















# Scikit eguivalent 
from sklearn.linear model import LinearRegression 
lin reg - LinearRegression() 


lin reg.fit(X,y) 

print("intercept & coefficient:Nn", lin reg.intercept , lin reg. 
coef ) 

print("predictions:\n", lin reg.predict(X new)) 


intercept & coefficient: 

[ 3.58859665] II 3.41876053]] 
predictions: 

IL 3.58859665] 

[ 7.00735719] 

[ 10.42611772]] 


Gradient Descent 


# Gradient Descent - Batch 
# (Batch: math includes full training set X.) 


# need to find partial derivative (slope) of the cost function 
# for each model parameter (theta). 


theta path bgd - [] 

eta - 0.1 + learning rate 

n_iterations = 1000 

m = 100 

theta = np.random.randn(2,1) # random initialization 

for iteration in range(n iterations): 
gradients = 2/m * X b.T.dot(X b.dot(theta) - y) 
theta = theta - eta * gradients 


theta_path_bgd.append(theta) 


print(theta) 


IL 3.58859665] 
[ 3.41876053]] 


++ + + H # + 


Gradient Descent - Stochastic 


Stochastic: finds gradients based on random instances 


adv: 
dis: 


better for huge datasets 

much more erratic than batch GD 

-- good for avoiding local minima 

-- bad b/c may not find optimum sol'n 


simulated annealing helps. (gradually reduces learning rate) 


theta_path_sgd = [] 


n_epochs, t@, ti = 50, 5, 50 # learning schedule hyperparameters 


def learning _schedule(t): 
return tO / (t + t1) 


theta 


np.random.randn(2,1) # random initialization 


for epoch in range(n_epochs): 


for i in range(m): 


random_index = np.random.randint(m) 


xi 
yi 


X_b[random_index:random_index+1] 
y[random_index:random_index+1] 


gradients = 2 * xi.T.dot(xi.dot(theta) - yi) 


eta = learning schedule(epoch * m + i) 
theta = theta - eta * gradients 
theta_path_sgd.append(theta) 


print(theta) 


[[ 3.6036273 ] 


[ 3.44079196]] 


# SGD Regression using Scikit: 


from sklearn.linear model import SGDRegressor 


sgd reg - SGDRegressor(n_iter=50, penalty=None, eta0=0.1) 
sgd reg.fit(X, y.ravel()) 


print(sgd reg.intercept , sgd reg.coef ) 


[ 3.57214013] | 3.39609675] 


# Gradient Descent - MiniBatch 
# adv: performance boost via GPUs 


theta path mgd = JJ 


n_iterations = 50 
minibatch size = 20 


import numpy.random as rnd 


rnd.seed(42) 
theta = rnd.randn(2,1) # random initialization 


to, t1 = 10, 1000 
def learning_schedule(t): 
return tO (t + t1) 


t=0 

for epoch in range(n_iterations): 
shuffled_indices = rnd.permutation(m) 
X_b_shuffled = X_b[shuffled_indices] 
y_shuffled = y[shuffled_indices] 


for i in range(0, m, minibatch size): 
iG ve al 


xi 
yi 


X_b_shuffled[i:i+minibatch_size] 
y_shuffled[i:i+minibatch_size] 


gradients = 2 * xi.T.dot(xi.dot(theta) - yi) 
eta = learning_schedule(t) 

theta = theta - eta * gradients 
theta_path_mgd.append(theta) 


print(theta) 


[[ 3.70412445] 
[ 3.54124923]] 


theta path bgd - np.array(theta path bgd) 


theta path sgd - np.array(theta path sgd) 


theta path mgd - np.array(theta path mgd) 


plt.figure(figsize=(10,4)) 

plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", line 
width=1, label="Stochastic") 

plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-t", line 
width=1, label-"Mini-batch") 

plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", line 
width=1, label="Batch") 

plt.legend(loc="upper right", fontsize=14) 
plt.xlabel(r"$\theta_0$", fontsize=20) 

plt.ylabel(r"$\theta_1$ ", fontsize=20, rotation=0) 

bitraxis( (2.5, 4.5, 2.3. 3.910 

#save fig("gradient descent paths plot") 

plt.show() 
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Polynomial Regression 
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# example quadratic equation + noise: y = ©.5*XA2 + X + 2 + noise 


m = 100 
X = 6 * np.random.rand(m, 1) - 3 
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1) 


plt.scatter(X,y) 
plt.show() 





75 


# fit using Scikit 


from sklearn.preprocessing import PolynomialFeatures 
from sklearn.linear model import LinearRegression 


# caution: PolynomialFeatures converts array of n features 
# into array of (ntd)!/d!n! features -- combinatorial explosions 
possible :-) 


poly features - PolynomialFeatures(degree-?, include bias-False) 
print(poly features) 


# X poly: original feature of X, plus its sguare. 
X poly = poly features.fit transform(X) 


#print(X, X poly) 
print(x[o], X_poly[0]) 


# fit it: 

lin reg - LinearRegression() 

lin reg.fit(X poly, y) 

print(lin reg.intercept , lin reg.coef ) 


# result estimate: 0.48x(1)12 + 0.99x(2) + 2.06 
# original: 0.50x(1)^2 + 1.00x(2) + 2.00 + gaussian noise 


PolynomialFeatures(degree=2, include_bias=False, interaction_onl 
y=False) 

[ 2.38942838] [ 2.38942838 5.709368 | 

[ 1.9735233] [[ 0.95038538 0.52577032]] 


X_n 
X_n 
y_n 


#te 
#pr 


plt. 
plt. 
plt. 
plt. 
plt. 
plt. 


ew = np.linspace(-3, 3, 100).reshape(100, 1) 


ew poly - poly features.transform(X new) 


ew = lin reg.predict(X new poly) 


stme = np.linspace(-3,3,20) 
int(testme, testme.reshape(20,1)) 


plot yy bh) 

plot(X_new, y_new, "r-", linewidth=2, label="Predictions") 
Xlabel("$x 1$", fontsize=14) 

ylabel("$y$", rotation=0, fontsize=14) 

legend(loc="upper left", fontsize=14) 

Las (les, 3, 0: A01) 


#save fig("guadratic predictions plot") 


plt 


. show() 
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rning Curves 


# another way to check for underfit & overfit: 
# use learning curve plots to see performance vs training set si 
ze. 


from sklearn.metrics import mean sguared error 
from sklearn.model selection import train test split 


# train model multiple times on various training subsets (of var 
ious sizes) 


def plot learning curves(model, X, y): 
X train, X val, y train, y val = train test split(X, y, test 
_size=0.2, random state-10) 
train errors, val errors = (Jl, Il 
for m in range(1, len(X train)): 
model.fit(X train|:m|, y_train[:m]) 
y_train_predict = model.predict(X_train[:m]) 
y_val_predict = model.predict(X_val) 
train_errors.append(mean_squared_error(y_train_predict, 
y_train[:m])) 
val_errors.append(mean_squared_error(y_val_predict, y_va 


1)) 


plt.plot(np.sgrt(train errors), "r-+", linewidth=2, label="T 
raining set”) 

plt.plot(np.sgrt(val errors), "b-", linewidth=3, label="Vali 
dation set") 

plt.legend(loc="upper right", fontsize=14) 

plt.xlabel("Training set size", fontsize=14) 

plt.ylabel("RMSE", fontsize=14) 


lin_reg = LinearRegression() 
plot_learning_curves(lin_reg, X, y) 
plt.axis([0, 80, 0, 3]) 

#save fig("underfitting learning curves plot") 
plt.show() 
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# repeat exercise for 10th-degree polynomial 
from sklearn.pipeline import Pipeline 


polynomial regression = Pipeline( ( 

("poly features", PolynomialFeatures(degree=10, include bias- 
False)), 

("sgd reg", LinearRegression()), 


)) 


plot_learning_curves(polynomial_regression, X, y) 
plt.axis([0,80,0,3]) 
plt .show( ) 


# note: training error rate much lower than on Linear Regression 
# note: training/validation gap closes to zero. good fit? 
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Bias/Variance Tradeoff 


e Bias: the part of generalization error due to wrong assumptions. 

e Variance: due to model sensitivity to small training variations. (More common 
in high-dimensional models.) 

e Irreducibility: due to data noise. 


e Rule of thumb: increasing model complexity increases variance & reduces 
bias (and vice versa.) 


Regularization 


e Used to reduce overfit by constraining the model (ex: reducing the # of 
degrees in a polynomial). 


# Ridge -- regularization term added to cost function. 
# alpha param -- forces model weights to minimal values. higher 
alpha = "flatter" function (converge to mean) 


J(0) = MSE(@) + a 2, 9? 
e Cost function: = 


# build dataset 


import numpy.random as rnd 


rnd.seed(42) 
= 20 
= 3 * rnd.rand(m, 1) 
514 0.5 X+ rnd.randn(m, 1) / 1.5 
new = np.linspace(0, 3, 100).reshape(100, 1) 


X < XxX 3 


# plot it 

ple plot (Xx. y, ~b.") 

plt.xlabel("$x_1$", fontsize=18) 
plt.ylabel("$y$", rotation=0, fontsize=18) 
plt.axis([0, 3, ©, 41) 


# apply Ridge regression 
from sklearn.linear_model import Ridge 


ridge reg = Ridge(alpha-1, solver="cholesky") 
ridge_reg.fit(X,y) 
ridge_reg.predict([[o.0],[1.5],[2.0],[3.0]]) 


array([[ 1.00650911], 
[ 1.55071465], 
[ 1.73211649], 
[ 2.09492018]]) 














# Ridge using SGD: 

sgd reg = SGDRegressor(penalty="12") 

sgd reg.fit(X,y.ravel()) 
ridge_reg.predict([[o.0],[1.5],[2.0],[3.0]]) 


array([[ 1.00650911], 
[ 1.55071465], 
[ 1.73211649], 
[ 2.09492018]]) 


# Lasso -- similar to Ridge, also adds regularization term 
# uses L1 norm (instead of 1/2 square of L2 norm, as in Ridge.) 
# -- tends to force least important features to zero. 


from sklearn.linear_model import Lasso 
lasso_reg = Lasso(alpha=0.1) 


lasso_reg.fit(X,y) 
lasso_reg.predict([[0.0],[1.5],[2.0],[3.0]]) 


array([ 1.14537356, 1.53788174, 1.66871781, 1.93038993]) 


# Elastic Net -- midddle ground. 
# regularization = mix of Ridge € Lasso (mix ratio "r") 


from sklearn.linear_model import ElasticNet 


elastic net = ElasticNet(alpha-0.1, 11 ratio-0.5) 
elastic net.fit(X,y) 
elastic_net.predict([[0.0],[1.5],[2.01,[3.0]]) 


array([ 1.08639303, 1.54333232, 1.69564542, 2.00027161]) 


# Early Stopping -- stop training when minimum validation error 
reached 


# build dataset 

rnd.seed(42) 

m = 100 

X = 6 * rnd.rand(m, 1) - 3 

Ya 2+ X 40.5 * X**2 + rnd.randn(m, 1) 


X_train, X_val, y train, y_val = train_test_split(X[:50], y[:50] 
.ravel(), test_size=0.5, random_state=10) 


from sklearn.preprocessing import StandardScaler 
from sklearn.pipeline import Pipeline 


poly_scaler = Pipeline(( 
("poly features", PolynomialFeatures( 
degree-90, 
include bias-False)), 
("std scaler", StandardScaler()), 


D 


X train poly scaled poly scaler.fit transform(X train) 


X val poly scaled = poly scaler.transform(X val) 


sgd reg = SGDRegressor(n_iter=1, 


penalty-None, 
eta0=0.0005, 
warm_start=True, 
learning_rate="constant", 
random_state=42) 


n_epochs = 500 
train_errors, val_errors = [], [] 


for epoch in range(n epochs): 
sgd_reg.fit(X_train_poly_scaled, y_train) 


y_train_predict = sgd_reg.predict(X_train_poly_scaled) 


y_val_predict sgd_reg.predict(X_val_poly_scaled) 
train_errors.append(mean_squared_error(y_train_predict, y_tr 
ain)) 
val_errors.append(mean_squared_error(y_val_predict, y_val)) 


best_epoch = np.argmin(val_errors) 
np.sqrt(val_errors[best_epoch] ) 


best_val_rmse 


plt.annotate( 'Best model', 
xy=(best_epoch, best val rmse), 
xytext=(best_epoch, best val rmse + 1), 
ha="center", 
arrowprops=dict(facecolor='black', shrink=0.05), 
fontsize-16, 


best val rmse -= 0.03 + just to make the graph look better 
plt.plot([9, n_epochs], [best val rmse, best val rmsel], "k:", li 
newidth=2) 

plt.plot(np.sgrt(val errors), "b-", linewidth=3, label="Validati 
on set") 

plt.plot(np.sgrt(train errors), "r--", linewidth=2, label="Train 
ing set") 

plt.legend(loc="upper right", fontsize=14) 

plt.xlabel("Epoch", fontsize=14) 

plt.ylabel("RMSE", fontsize=14) 


#save fig("early stopping plot") 
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Logistic Regression 


e commonly used to est probability of instance belonging to specified class. 
positive if >50% (labeled "1"), otherwise labeled "0". 


e logistic is a sigmoid function, outputs O<n<1. 


e cost function = average over all training data. It is convex, so gradient 
descent will find global minimum. 


#from sklearn import datasets 
#iris = datasets.load iris() 


import numpy as np 
from sklearn import datasets 


iris - datasets.load iris() 
print(iris.keys()) 


X = iris["data"][:, 3:] + petal width 
y = (iris["target"] == 2).astype(np.int) + 1 if Iris-Virginica, 
else 0 


dict_keys(['target_names', 'DESCR', 'data', 'target', 'feature n 
ames']) 


# train a LR model 


from sklearn.linear_model import LogisticRegression 
log reg-LogisticRegression() 
log reg.fit(X,y) 


# predict probability of flowers with petal widths - 0-3cm 
np.linspace(0, 3, 1000).reshape(-1, 1) 
log reg.predict proba(X new) 


X new 

y_proba 
print(y_proba) 
decision_boundary 


X_new[y_proba[:, 1] >= 0.5][0] 


plt.plot(X_new, y probaf:, 1], "g-", label="Iris-Virginica") 
plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris-Virginica" 
) 

plt.text(decision boundaryt0.02, 0.15, "Decision boundary", fon 
tsize=14, color="k", ha="center") 

plt.xlabel("Petal width (cm)", fontsize=14) 
plt.ylabel("Probability", fontsize=14) 

plt.legend(loc="center left", fontsize=14) 

plt .show( ) 
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# what's the prediction for petal length = 


print(log_reg.predict([[1.5], 


[0 1] 


# Logistic Regressin contour 


[1.7]])) 


plot 


ALS ole ale Mete 


# with multiple decision boundaries (not just 50%) 


from sklearn.linear_model import LogisticRegression 


X 
y 


iris["data"][:, 


(2, 3)] 


# petal length, 


(iris["target"] == 2).astype(np.int) 


log reg = LogisticRegression(C=10**10) 


log_reg.fit(X, y) 


x0, x1 = np.meshgrid( 


petal width 


np.linspace(2.9, 7, 500).reshape(-1, 1), 
np.linspace(0.8, 2.7, 200).reshape(-1, 1), 


# ravel(): return contiguous flattened array 
X_new = np.c_[x0.ravel(), x1.ravel()] 


y_proba = log_reg.predict_proba(X_new) 


plt.figure(figsize=(10, 4)) 
plt.plot(X[y==0, 0], X[y==0, 1], "bs") 
plt.plot(X[y==1, 0], X[y==1, 1], "g%") 


zz = y_proba[:, 1].reshape(x0.shape) 
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg) 


left_right = np.array([2.9, 7]) 
boundary = -(log_reg.coef_[0][0] * left right + log reg.intercep 
t_[0]) / log_reg.coef_[0][1] 


plt.clabel(contour, inline=1, fontsize=12) 

plt.plot(left right, boundary, "k--", linewidth=3) 

plt.text(3.5, 1.5, "Not Iris-Virginica", fontsize=14, color="b", 
ha="center") 

plt.text(6.5, 2.3, "Iris-Virginica", fontsize=14, color="g", ha= 
"center") 

plt.xlabel("Petal length", fontsize=14) 

plt.ylabel("Petal width", fontsize=14) 

plt.axis([2.9, 7, 0.8, 2.71) 

#save fig("logistic regression contour plot") 

plt.show() 
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Softmax Regression (Multinomial Logistic 
Regression) 


Predicts one class at a time (multiclass, not multioutput). Use only for 


mutually exclusive classes. 


: s(x) = 0,7 -x 
Scoring for K classes: * * 


Softmax function (aka normalized exponential): 


re y = argmax o(s(x)), = argmax s,(x) = argmax (0, - x) 
Prediction: k k k 


Uses cross entropy to minimize cost function. (Same as log loss, used for 


Jia) = 


1 m K 25 Lam 
e | DE log (pk | 
Logistic Regression, when k=2.) ss 
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# use Softmax to classify iris flowers 


iris["data"][:, (2, 3)] # petal length, width 
iris["target"] 


< 
Il 


# Scikit LR can be switched to Softmax with "multinomial" settin 
# also defaults to L2 regularization (control with C parameter) 
softmax reg = LogisticRegression(multi class-"multinomial", solve 
r-"Ibigs" 6510) 

softmax_reg.fit(X, y) 

# predict iris 5cm long, 2cm wide: 

softmax_reg.predict([[5, 2]]) 


softmax_reg.predict_proba([[5,2]]) 


array([[ 6.33134078e-07,  5.75276067e-02,  9.42471760e-01]]) 


CO 
o 


# softmax contour plot 


x0, x1 = np.meshgrid( 
np.linspace(0, 8, 500).reshape(-1, 1), 
np.linspace(0, 3.5, 200).reshape(-1, 1), 
) 


X new = np.c_[x0.ravel(), x1.ravel()] 


y_proba = softmax reg.predict proba(X new) 
y_predict - softmax reg.predict(X new) 


zz1 = y_proba[:, 1].reshape(x0.shape) 
ZZ = y_predict.reshape(x0.shape) 


plt.figure(figsize=(10, 4)) 

plt.plot(X[y==2, 0], X[y==2, 1], "g^", label-"Iris-Virginica") 
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris-Versicolor") 
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris-Setosa") 


from matplotlib.colors import ListedColormap 
custom cmap = ListedColormap(['#fafabO', '#9898ff', '#a0faa0']) 


plt.contourf(x0, x1, zz, cmap-custom cmap, linewidth=5) 
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg) 
plt.clabel(contour, inline=1, fontsize=12) 
plt.xlabel("Petal length", fontsize=14) 
plt.ylabel("Petal width", fontsize=14) 
plt.legend(loc="center left", fontsize=14) 

pitcaxis(0,) 7. 6. 3.51) 

#save fig("softmax regression contour plot") 

plt.show() 
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SVM Classification (Linear) 


e Well suited for complex, small/medium dataset classification. 


%matplotlib inline 
import matplotlib.pyplot as plt 


from sklearn.svm import SVC 
from sklearn import datasets 


iris = datasets.load_iris() 
X = iris["data"][:, (2, 3)] + petal length, petal width 
y = iris["target"] 


setosa_or_versicolor = (y == 0) | (y == 1) 
X 
y 


X[setosa_or_versicolor] 


y[setosa_or_versicolor] 


# SVM Classifier model 
svm_clf = SVC(kernel="linear", C=float("inf")) 
svm_c1f.fit(X, Y) 


SVC(C=inf, cache size-200, class weight-None, coef0=0.0, 

decision function shape-None, degree-3, gamma-'auto', kernel-' 
linear', 

max iter--1, probability-False, random state-None, shrinking-T 
rue, 

tol=0.001, verbose=False) 


Bad models 


import numpy as np 


x0 = np.linspace(0, 5.5, 200) 
pred_1 = 5*x0 - 20 

KON die 

o ins xO t @.5 


pred_2 
pred_3 


def plot_svc_decision_boundary(svm_clf, xmin, xmax): 
w = svm_clf.coef_[0] 
b = svm_clf.intercept_[0] 


# At the decision boundary, w0*x0 + w1*x1 tb = 0 
# => X1 = -w0/w1 * x0 - b/w1 

x0 = np.linspace(xmin, xmax, 200) 
decision_boundary = -w[0]/w[1] * x0 - b/w[1] 


margin = 1/w[1] 
gutter_up = decision_boundary + margin 
gutter_down = decision boundary - margin 


Svs = svm clf.support vectors 
plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors-' #FFAAAA' 


) 
plt.plot(x0, decision boundary, "k-", linewidth=2) 
plt.plot(x0, gutter_up, "k--", linewidth=2) 
plt.plot(x0, gutter down, "k--", linewidth=2) 


plt.figure(figsize=(12,2.7)) 


plt.subplot (121) 

plt.plot(x0, pred_1, "g--", linewidth=2) 

plt.plot(x0, pred 2, "m-", linewidth=2) 

plt.plot(x0, pred 3, "r-", linewidth=2) 

plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versico 
lor") 

plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa" 


plt.xlabel("Petal length", fontsize=14) 
plt.ylabel("Petal width", fontsize=14) 
plt.legend(loc="upper left", fontsize=14) 
pit asie. 5.5, 0, 2)) 
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plt. 


subplot (122) 


plot svc decision boundary(svm clf, 0, 5.5) 


plt. 
plt. 
plt. 
plt. 
plt. 


plot(X[:, 0][y==1], X[:, 1][y==1], "bs") 
plot(X[:, O][y==0], X[:, 1][y==0], "yo") 
Xlabel( "Petal length", fontsize=14) 

pas 190; 5.5; 0, 21) 

show( ) 


+ On left: 
# dashed line = basically useless decision boundary. 


# solid lines = OK for this dataset, but no margins. Probably wi 


11 not work well on new instances. 


# On right: SVM finds widest possible "street" between classes. 





Iris-Versicolor 
Iris-Setosa 
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# sensitivity to feature scaling: 


Xs = np.array([[1, 50], [5, 20], [3, 80], (5, 60]]).astype(np. fl 
oat64) 

ys = np.array([0, 0, 1, 1]) 

svm clf = SVC(kernel="linear", C=100) 

svm clf.fit(Xs, ys) 


plt.figure(figsize-(12,3.2)) 

plt.subplot(121) 

plt.plot(Xs[:, 0][ys==1], Xs[:, 1][ys==1], "bo") 
plt.plot(Xs[:, 0][ys==0], Xs[:, 1][ys==0], "ms") 
plot svc decision boundary(svm clf, ©, 6) 
plt.xlabel("$x_0$", fontsize=20) 
plt.ylabel("$x 1$ ", fontsize=20, rotation=0) 
plt.title("Unscaled", fontsize-16) 

plt.axis([0, 6, 0, 90]) 


from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler() 

X_scaled = scaler.fit_transform(Xs) 

svm clf.fit(X scaled, ys) 


plt.subplot(122) 

plt.plot(X scaled[:, 0][ys==1], X scaledf:, 1][ys==1], "bo") 
plt.plot(X_scaled[:, 0][ys==0], X scaledf:, 1][ys==0], "ms") 
plot svc decision boundary(svm clf, -2, 2) 
plt.xlabel("$x_0$", fontsize=20) 

plt.title("Scaled", fontsize=16) 

plt.axis([-2, 2, =2, 21) 


# SVMs are sensitive to feature scaling. 
# Plot on right has much more robust feature boundary. 


27 2, 52 2] 


Unscaled > Scaled 























X scaled 


array([[-1.50755672, -0.11547005], 
[ 0.90453403, -1.5011107 ], 
[-0.30151134, 1.27017059], 
[ 0.90453403, 0.34641016]]) 


"hard" margin classification: 

- all instances need to be "out of the street". 

- all instances need to be "on the right side of the street". 
problem: doable only if data is linearly separable 


++ +t + H + 


problem: very sensitive to outliers 


X_outliers = np.array([[3.4, 1.3], [3.2, 0.8]]) 
y_outliers = np.array([0, 0]) 

Xo1 = np.concatenate([X, X_outliers[:1]], axis=0) 
yol = np.concatenate([y, y_outliers[:1]], axis=0) 
Xo2 = np.concatenate([X, X_outliers[1:]], axis=0) 
yo2 = np.concatenate([y, y_outliers[1:]], axis=0) 


svm_clf2 = SVC(kernel="linear", C=10**9)*float("inf")) 
svm clf2.fit(Xo2, yo2) 


plt.figure(figsize-(12,2.7)) 
plt.subplot(121) 


plt.plot(Xo1[:, OJ(yo1-—1J, Xo1[:, 1][yo1==1], "bs") 
plt.plot(Xo1[:, 0][yo1==0], Xo1[:, 1][yo1==0], "yo") 


plt.text(0.3, 1.0, "Impossible!", fontsize=20, color="red") 
plt.xlabel("Petal length", fontsize=14) 
plt.ylabel("Petal width", fontsize=14) 
plt.annotate("Outlier", 
xy=(X_outliers[0][0], X_outliers[0][1]), 
xytext=(2.5, 1.7), 
ha="center", 
arrowprops=dict(facecolor='black', shrink=0.1), 
fontsize-16, 


) 
pit-axis (Ion 5.5, 0, 21) 


plt.subplot (122) 
plt.plot(Xo2[:, 0][yo2==1], Xo2[:, 1][yo2==1], "bs") 
plt.plot(Xo2[:, 0][yo2==0], Xo2[:, 1][yo2==0], "yo") 
plot_svc_decision_boundary(svm_clf2, 0, 5.5) 
plt.xlabel("Petal length", fontsize=14) 
plt.annotate("Outlier", 
xy=(X_outliers[1][0], X_outliers[1][1]), 
xytext=(3.2, 0.08), 
ha="center", 
arrowprops=dict(facecolor='black', shrink=0.1), 
fontsize-16, 


) 
pit-a45(107 550211 
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X scaled 


array([[-1.50755672, -0.11547005], 
[ 0.90453403, -1.5011107 ], 
[-0.30151134, 1.27017059], 
[ 0.90453403, 0.34641016]]) 


# soluton to "hard margins" problem: 
# control hardness with C hyperparameter 


from sklearn import datasets 

from sklearn.pipeline import Pipeline 

from sklearn.preprocessing import StandardScaler 
from sklearn.svm import LinearSVC 


iris = datasets.load_iris() 
X = 1ris["data"][:, (2, 3)] # petal length, petal width 
y = (iris["target"] == 2).astype(np.float64) + Iris-Virginica 


scaler = StandardScaler() 
LinearSVC(C=100, loss="hinge") 
LinearSVC(C=1, loss="hinge") 


svm_clf1 


svm_clf2 


scaled svm clf1 = Pipeline(( 
("scaler", scaler), 
("linear svc", svm clf1), 
)) 
scaled_svm_clf2 = Pipeline( ( 
("scaler", scaler), 
("linear svc", svm_clf2), 


)) 


scaled_svm_c1f1.fit(X, y) 
scaled svm c1f2.fit(X, y) 


scaled_svm_c1f2.predict([[5.5, 1.711) 


array([ 1.]) 


X scaled 


array([[-1.50755672, -0.11547005], 
[ 0.90453403, -1.5011107 ], 
[-0.30151134, 1.27017059], 
[ 0.90453403, 0.34641016]]) 


# Convert to unscaled parameters 

b1 = svm clf1.decision function([-scaler .mean. / scaler.scale_]) 
b2 = svm_clf2.decision_function([-scaler.mean_ / scaler.scale_]) 
wl = svm_clfi.coef_[0] / scaler.scale_ 

w2 = svm_clf2.coef_[0] / scaler.scale_ 

np.array([b1]) 

svm_clf2.intercept_ = np.array([b2]) 


svm clf1.intercept . 


svm clf1.coef = np.array([w1]) 
svm clf2.coef = np.array([w2]) 


# Find support vectors (LinearSVC does not do this automatically) 


easy. 2 1 

support vectors idxl (t * (X.dot(w1) + b1) < 1).ravel() 
support_vectors_idx2 = (t * (X.dot(w2) + b2) < 1).ravel() 
svm clf1.support vectors = X[support_vectors_idx1] 


svm clf2.support vectors = X[support_vectors_idx2] 


EA JE 


plt.figure(figsize-(12,3.2)) 

plt.subplot (121) 

plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^", label-"Iris-Virgini 
ca”) 

plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs", label="Iris-Versico 
lor") 

plot_svc_decision_boundary(svm_c1f1, 4, 6) 

plt.xlabel("Petal length", fontsize=14) 

plt.ylabel("Petal width", fontsize=14) 

plt.legend(loc="upper left", fontsize=14) 

plt.title("$C = {}$".format(svm_clf1.C), fontsize=16) 
plt.axis([4, 6, 0.8, 2.8]) 


plt.subplot (122) 

plt.plot(X[:, 0][y==1], X[:, 1][y==1], "91") 
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs") 
plot_svc_decision_boundary(svm_clf2, 4, 6) 
plt.xlabel("Petal length", fontsize=14) 

plt.title("$c = {}$".format(svm_clf2.C), fontsize=16) 
plt.axis([4, 6, 0.8, 2.8]) 


[4, 6, 0.8, 2.8] 


C=100 
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SVM Classification (Non-Linear) 


# some (most?) datasets are not linearly separable. simple examp 
le below. 


X1D 


np.linspace(-4, 4, 9).reshape(-1, 1) 


X2D 


np.c (X1D, X1D**2] # adds 2nd, non-linear dimension. 


y = np-array([0, ©, Ll, 1, 1, 1, 1; @ 91) 


plt.figure(figsize=(10, 4)) 


plt.subplot (121) 

plt.grid(True, which-'both") 
plt.axhline(y=0, color='k') 

plt.plot(X1D[:, 0][y==0], np.zeros(4), "bs") 
plt.plot(X1D[:, 0][y==1], np.zeros(5), "g") 
plt.gca().get_yaxis().set_ticks([]) 
plt.xlabel(r"$x 1$", fontsize=20) 

pit <axis([-4.5, 4.5, -0.2, 0.21) 


plt.subplot(122) 

plt.grid(True, which-'both") 

plt.axhline(y=0, color='k') 

plt.axvline(x=0, color='k') 

plt.plot(X2D[:, 0][y==0], X2D[:, 1][y==0], "bs") 
plt.plot(X2D[:, 0][y==1], X2D[:, 1][y==1], "g\") 
plt.xlabel(r"$x_1$", fontsize=20) 
plt.ylabel(r"$x_2$", fontsize-20, rotation=0) 
plt.gca().get_yaxis().set_ticks([0, 4, 8, 12, 16]) 
plt.plot([-4.5, 4.5], [6.5, 6.5], "r--", linewidth=3) 
plt.axis([-4.5, 4.5, -1, 17]) 


plt.subplots_adjust(right=1) 


#save_fig("higher_dimensions_plot", tight_layout=False) 
plt.show() 


# result: adding 2nd dimension (on right) makes dataset linearly 
separable 

































# test on "moons" dataset 


from sklearn.datasets import make moons 
X, y = make moons(n samples=100, noise=0.15, random state-/2) 


def plot dataset(X, y, axes): 
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs") 
plt.plot(X[:, O][y==1], X[:, 1][y==1], "g^") 
plt.axis(axes) 
plt.grid(True, which-'both") 
plt.xlabel(r"$x 1$", fontsize-20) 
plt.ylabel(r"$x_2$", fontsize=20, rotation=0) 


plot_dataset(X, y, [-1.5, 2.5, -1, 1.5]) 
plt.show() 
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# do this in Scikit with a Pipeline. Contents: 
# 1) Polynomial Features 

# 2) StandardScaler 

# 3) LinearSVC 


from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import PolynomialFeatures 


polynomial svm clf = Pipeline( ( 
("poly features", PolynomialFeatures(degree-3)), 
("scaler", StandardScaler()), 
("svm clf", LinearSVC(C=10, loss="hinge")) 
)) 


polynomial svm clf.fit(X, y) 
def plot predictions(clf, axes): 


x0s = np.linspace(axes[0], axes[1], 100) 
x1s = np.linspace(axes[2], axes[3], 100) 


x0, x1 = np.meshgrid(x0s, x1s) 

X = np.c_[x0.ravel(), x1.ravel()] 

y_pred = clf.predict(X).reshape(x0.shape) 

y_decision = clf. decision _function(X).reshape(x0.shape) 
plt.contourf(x0, x1, y_pred, cmap=p1t.cm.brg, alpha=0.2) 
plt.contourf(x0, x1, y decision, cmap=p1t.cm.brg, alpha=0.1) 


plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5]) 
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5]) 


#save fig("moons polynomial svc plot") 
plt.show() 








Solving polynomial-feature problems (aka 
combinatorial explosion) via the kernel trick 


CnUo support vector macnines.ma 


from sklearn.svm import SVC 


# train SVM classifier using 3rd-degree polynomial kernel 
poly kernel svm clf = Pipeline(( 
("scaler", StandardScaler()), 
(“syn elf", SVC 
kernel="poly", degree=3, coef0=1, C=5)))) 


# train SVM classifier using 10th-degree polynomial kernel (for 
comparison) 
poly100 kernel svm clf = Pipeline(( 
("scaler", StandardScaler()), 
("svm clf", SVC(kernel="poly", degree=10, coef0-100, C=5 
)) 
)) 


poly kernel svm clf.fit(X, y) 
poly100 kernel svm clf.fit(X, y) 


plt.figure(figsize=(11, 4)) 


plt.subplot(121) 

plot predictions(poly kernel svm clf, [-1.5, 2.5, -1, 1.5)) 
plot dataset(x, y, [-1.5, 2.5, -1, 1:5]) 

plt.title(r"$d=3, r=1, C=5$", fontsize=18) 


plt.subplot(122) 

plot predictions(poly100 kernel svm clf, [-1.5, 2.5, -1, 1.5]) 
plot_dataset(X, y, [-1.5, 2.5, -1, 1-51) 

plt.title(r"$d=10, r=100, C-5$", fontsize=18) 


#save_fig("moons_kernelized_polynomial_svc_plot") 
plt.show() 


left: 3rd-degree polynomial; right: 10th-degree polynomial. 
if overfitting, reduce polynomial degree. if underfitting, bum 
LE Uo 


HO + + 


"coef0": controls high- vs low-degree polynomial influence. 





d=3,r=1,C=5 ia d=10,r=100,C=5 





Adding Similarity Features 


e similarity function: measures how much an instance resembles specified 
landmark. 


# define similarity function to be Gaussian Radial Basis Functio 
n (RBF) 
# equals O (far away) to 1 (at landmark) 


def gaussian_rbf(x, landmark, gamma): 
return np.exp(-gamma * np.linalg.norm(x - landmark, axis=1)** 


2) 


gamma = 0.3 


x1s = np.linspace(-4.5, 4.5, 200).reshape(-1, 1) 
x2s = gaussian rbf(x1s, -2, gamma) 
x3s - gaussian rbf(x1s, 1, gamma) 


XK = np.c_[gaussian_rbf(X1D, -2, gamma), gaussian rbf(X1D, 1, ga 
mma) | 
yko= mp. array (iG, 0 41. 4. 1 1 1 0, 01) 


plt.figure(figsize-(11, 4)) 


plt.subplot(121) 
plt.grid(True, which-'both") 


plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 


plt. 
plt. 
plt. 


plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 


plt. 
plt. 


plt. 


axhline(y=0, color='k') 
scatter(x=[-2, 1], y=[0, 0], s=150, alpha=0.5, c="red") 
plot(X1D[:, 0][yk==0], np.zeros(4), "bs") 
plot(X1D[:, O][yk==1], np.zeros(5), "g") 
plot(x1s, x2s, "g--") 
plot(x1ís, x3s, "b:") 
gca().get_yaxis().set_ticks([0, 0.25, 0.5, 0.75, 1]) 
xlabel(r"$x_1$", fontsize=20) 
ylabel(r"Similarity", fontsize=14) 
annotate(r'$\mathbf{x}$', 
xy=(X1D[3, Ol, 0), 
xytext=(-0.5, 0.20), 
ha="center", 
arrowprops=dict(facecolor='black', shrink=0.1), 
fontsize=18, 
) 
text(-2, 0.9, "$x_2$", ha="center", fontsize=20) 
text(1, 0.9, "$x_3$", ha="center", fontsize=20) 
axis([-4.5, 4.5, -0.1, 1.1]) 


subplot (122) 
grid(True, which='both') 
axhline(y=0, color='k') 
axvline(x=0, color='k') 
plot(XK[:, 0][yk==0], XK[:, 1][yk==0], "bs") 
plot(XK[:, 0][yk==1], XK[:, 1][yk==1], "g4") 
Xlabel(r"$x 2$", fontsize=20) 
Ylabel(r"$x 38 ", fontsize=20, rotation=0) 
annotate(r'$\phi\left(\mathbf{x}\right)$', 
xy=(XK[3, 0], XK[3, 1]), 
xytext=(0.65, 0.50), 
ha="center", 
arrowprops=dict(facecolor='black', shrink=0.1), 
fontsize=18, 
) 
plot([-0.1, 1.1], [0.57, -0.1], "r--", linewidth=3) 
Axis 1-01, 1.1; SO, N 


subplots_adjust(right=1) 


#save fig("kernel method plot") 
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X1 example = X1D[3, 0] 
for landmark in (-2, 1): 

k = gaussian_rbf(np.array([[x1_example]]), np.array([[landma 
rk]]), gamma) 

print("Phi({}, {}) = {}".format(x1 example, landmark, k)) 


Phi(-1.0, -2) = | 0.74081822] 
Phi(-1.0, 1) = | 0.30119421] 


Using a Gaussian RBF Kernel 


rbf kernel svm clf = Pipeline(( 
("scaler", StandardScaler()), 
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001)) 


)) 
rbf kernel svm clf.fit(X, y) 


from sklearn.svm import SVC 


gammal, gamma2 = 0.1, 5 
C1, C2 = 0.001, 1000 


hyperparams = (gammal, C1), (gammal, C2), (gamma2, C1), (gamma2, 
C2) 


svm_clfs = [] 
for gamma, C in hyperparams: 
rbf_kernel_svm_clf = Pipeline(( 
("scaler", StandardScaler()), 
("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C) ) 
)) 
rbf_kernel_svm_clf.fit(X, y) 
svm clfs.append(rbf kernel svm cif) 


plt.figure(figsize=(11, 7)) 


for i, svm clf in enumerate(svm_clfs): 

pit subplot (221-210 

plot predictions(svm elf, [-1.5, 2.5, -1, 1.5]) 

plot dataset(X, y, [-1.5, 2.5, -1, 1.51) 

gamma, C = hyperparams[i] 

plt.title(r"Sigamma = {}, C = {}$".format(gamma, C), fontsiz 
e=16) 


#save fig("moons rbf svc plot") 
plt.show() 


# below: model trained with different values of gamma and C. 

H GAMMA: 

# bigger gamma = narrower bell curve, so each instance's area of 
influence = smaller. 

# smaller gamma: bigger bell curve = smoother decision boundary. 


y=0.1,C=0.001 | y=0.1,C =1000 











Computational Complexity 


e LinearSVC class: based on /iblinear library. Doesn't support kernel trick. 
Scales linearly to #instances and #features; training complexity ~O(mxn). 

e SVC class: based on /ibsvm library. Does support kernel trick. Training 
complexity is O(m*2xn) to O(m*3xn) = MUCH slower on larger training 
datasets. 


SVM Regression (Linear & Non-Linear) 


e Objectives: 1) fit max #instances on the street; 2) find min #margin violations 
(instances "off" the street"). 

e Width controlled by epsilon hyperparameter. 

e Below: random linear dataset. two training results with different vals of 
epsilon. 


from sklearn.svm import LinearSVR 


import numpy.random as rnd 


rnd.seed 
m = 50 


(42) 


X = 2 * rnd.rand(m, 1) 


(4 + 


< 
Il 


svm regl 
svm reg2 


svm_regl. 
svm_reg2. 


def find 


3* X + rnd.randn(m, 1)).ravel() 


LinearSVR(epsilon-1.5) 


LinearSVR(epsilon-0.5) 
fit(X, y) 
fit(X, y) 


_support_vectors(svm_reg, X, y): 


y_pred = svm_reg.predict(X) 


off_margin = (np.abs(y - y_pred) >= svm_reg.epsilon) 


return np.argwhere(off_margin) 


svm_reg1 
svm_reg2 


eps_x1 = 


.support_ = find_support_vectors(svm_reg1, X, y) 


Support = find_support_vectors(svm_reg2, X, y) 


als 


eps_y_pred = svm_regi.predict([[eps_x1]]) 


def plot svm regression(svm reg, X, y, axes): 
x1s = np.linspace(axes[0], axes[1], 100).reshape(100, 1) 
y_pred = svm_reg.predict(x1s) 
plt.plot(x1s, y_pred, "k-", linewidth=2, label=r"$\hat{y}$") 
plt.plot(x1s, y_pred + svm_reg.epsilon, "k--") 
plt.plot(x1s, y_pred - svm_reg.epsilon, "k--") 
plt.scatter(X[svm_reg.support_], y[svm_reg.support_], s=180, 
facecolors='#FFAAAA' ) 
pit plot. y, "bo") 
plt.xlabel(r"$x_1$", fontsize=18) 
plt.legend(loc="upper left", fontsize=18) 
plt.axis(axes) 


plt.figure(figsize-(9, 4)) 
plt.subplot (121) 
plot svm regression(svm reg1, X, y, [9, 2, 3, 111) 
plt.title(r"$\epsilon = {}$".format(svm_regi.epsilon), fontsize= 
18) 
plt.ylabel(r"$y$", fontsize=18, rotation=0) 
#plt.plot([eps_x1, eps xi], [eps_y_pred, eps_y_pred - svm regi.e 
psilon], "k-", linewidth=2) 
plt.annotate( 
'', xy=(eps_x1, eps_y_pred), xycoords='data', 
xytext=(eps_x1, eps_y_pred - svm_reg1.epsilon), 
textcoords='data', arrowprops={'arrowstyle': '<->', 'lin 
ewidth': 1.5} 
) 
plt.text(0.91, 5.6, r"$\epsilon$", fontsize=20) 
plt.subplot(122) 
plot_svm_regression(svm_reg2, X, y, [9, 2, 3, 11]) 
plt.title(r"$\epsilon = {}$".format(svm_reg2.epsilon), fontsize= 
18) 
#save fig("svm regression plot") 
plt.show() 


e=1.5 e=0.5 
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# Use kernel-ized SVM model to handle nonlinear regression jobs. 


from sklearn.svm import SVR 


# random quadratic training set. 

rnd.seed(42) 

m = 100 

2 * rnd.rand(m, 1) - 1 

(0.2 + @.1 * X+ 0.5 * X**2 + ma. randn(m, L1)/10).ravel() 


< 
Il 


svm_poly_reg1 SVR(kernel="poly", degree=2, C=100, epsilon-0.1) 


SVR(kernel="poly", degree=2, C=0.01, epsilon=0.1 


svm_poly_reg2 
) 

svm_poly_reg1.fit(X, y) 
svm_poly_reg2.fit(X, y) 


SVR(C=0.01, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, ga 
mma='auto', 

kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose 
=False) 


plt.figure(figsize-(9, 4)) 

plt.subplot(121) 

plot svm regression(svm poly reg1, X, y, [-1, 1, 9, 1]) 
plt.title(r"$degree={}, C={}, \epsilon = {}$".format(svm poly re 
gi.degree, svm_poly_reg1.C, svm_poly_reg1.epsilon), fontsize=18) 
plt.ylabel(r"$y$", fontsize=18, rotation=0) 

plt.subplot (122) 

plot. svm regression(svm poly. reg2, X, y, [-1 1, 9, 1]) 
plt.title(r"$degree={}, C={}, \epsilon = {}$".format(svm poly re 
g2.degree, svm poly reg2.C, svm poly reg2.epsilon), fontsize=18) 
#save fig("svm with polynomial kernel plot") 

plt.show() 


# left: little regularization (large C) 
# right: much more regularization (little C) 


_ degree = 2,4 =100,e=0.1 degree = 2,C=0.01,£= 0.1 





























Under the Hood 


e conventions: b = bias term; w = feature weights vector. 


iris = datasets.load_iris() 
X = iris["data"][:, (2, 3)] + petal length, petal width 
y = (iris["target"] == 2).astype(np.float64) + Iris-Virginica 


from mp1_toolkits.mplot3d import Axes3D 


def 
0; 


plot 3D decision function(ax, w, b, x1_lim=[4, 6], x2_lim=[0 
2,81): 

X1 in bounds = (X[:, 0] > x1_lim[O]) € (X[:, 0] < x1 lim[1]) 
X crop = X[x1_in_bounds] 

y_crop = y[x1_in_bounds] 

x1s = np.linspace(x1_lim[0], x1_lim[1], 20) 

x2s = np.linspace(x2_lim[0], x2_lim[1], 20) 

X1, x2 = np.meshgrid(x1s, x2s) 

xs = np.c_[x1.ravel(), x2.ravel()] 

df = (xs.dot(w) + b).reshape(x1.shape) 

m= 1 / np.linalg.norm(w) 

boundary_x2s = -x1s*(w[0]/w[1])-b/w[1] 

margin_x2s_1 -x1s*(w[0]/w[1])-(b-1)/w[1] 

margin_x2s_2 = -x1s*(w[0]/w[1])-(b+1)/w[1] 

ax.plot surface(x1s, x2, 0, color="b", alpha=0.2, cstride=100 


, rstride=100) 


0$") 


ax.plot(x1s, boundary x2s, 0, "k-", linewidth=2, label-r"$h- 


ax.plot(x1s, margin x2s 1, 0, "k--", linewidth=2, label=r"S$h 


=\pm 1$") 


gr 


Nas 


ax.plot(x1s, margin x2s 2, 0, "k--", linewidth=2) 
ax.plot(X_crop[:, 0][y_crop==1], X_crop[:, 1][y_crop==1], 0, 
) 

ax.plot_wireframe(x1, x2, df, alpha=0.3, color="k") 
ax.plot(X_crop[:, 0][y_crop==0], X_crop[:, 1][y_crop==0], ©, 
) 

ax.axis(x1_lim + x2 lim) 

ax.text(4.5, 2.5, 3.8, "Decision function $h$", fontsize=15) 
ax.set xlabel(r"Petal length", fontsize=15) 
ax.set_ylabel(r"Petal width", fontsize=15) 

ax.set zlabel(r"$h = \mathbf{w}4t Ncdot \mathbf{x} + b$", fo 


ntsize-18) 


fig 
ax1 


ax.legend(loc="upper left", fontsize=16) 


plt.figure(figsize-(11, 6)) 
fig.add_subplot(111, projection-'3d") 


plot_3D_decision_function(ax1, w-svm clf2.coef (Ol, b=svm_clf2.i 
ntercept_[0]) 


#save fig("iris 3D plot") 
plt.show() 








Training Objectives 


e Slope of a decision function equals a weight vector's norm (||w]||) 
e Divide slope by 2 ==> any points where decision function = +1/-1 will be 2x 
away from decision boundary. 


w=1 2.0 w=0.5 . 




















e So we want minimal ||w|| to get max margins 

e If we also want zero margin violations, then decision function needs to be 
GT1 (positive) and LT1 (negative). 

e if soft margins OK - need to define a slack variable (C) for tradeoff. 


Quadratic programming 


e Hard- 8 soft-margin problems = convex quadratic optimization problems with 


linear constraints, ie quadratic programming (QP) problems. See Convex 
Optimization for more info. 


todo: The dual problem 
todo: Kernelized SVM 


todo: Online (incremental learning) SVMs 


e Linear SVM classifiers often use SGD to find a min-cost solution. SGD 
converges much more slowly than QP-based methods. 

e implementation: 

e implementation: 


Intro 


import os 
import numpy.random as rnd 


Training & Visualization 


ssifier 


from sklearn.datasets import load iris 
from sklearn.tree import DecisionTreeClassifier 


iris - load iris() 

X = iris.data[:, 2:] # petal length and width 
y = iris.target 

tree_clf = DecisionTreeClassifier (max_depth=2 ) 
tree clf.fit(X, y) 


DecisionTreeClassifier(class weight-None, criterion='gini', max 
depth-2, 
max features-None, max leaf nodes-None, 
min impurity split-1e-07, min samples leaf-1, 
min samples split-2, min weight fraction leaf-0.0, 
presort-False, random state-None, splitter='best') 
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# graph it into a .dot file 
from sklearn.tree import export graphviz 


def image path(fig id): 
#return os.path... 
return fig id 


export graphviz( 
tree clf, 
out file-image path("iris tree.dot"), 
feature names-iris.feature names[2:], 
Class names-iris.target names, 
rounded-True, 
filled-True 


# convert to PDF or PNG using command-line tool. 
! dot -Tpng iris tree.dot -o iris tree.png 


petal length (cm) <= 2.45 
gini = 0.6667 
samples = 150 
value = [50, 50, 50] 
class = setosa 








petal width (cm) <= 1.75 
gini = 0.5 
samples = 100 
value = [0, 50, 50] 
class = versicolor 


à 
NO 
© 


>) 


Predictions 


e DTs require very little data prep. No feature scaling & centering. 

e SciKit uses CART algorithm. (only two children per node.) Other algos, ex 
ID3, can build DTs with >2 children per node. 

e gini attribute refers to a node's "impurity" (gini=0 if all applicable training 
instances belong to same class.) 


Plot DT decision boundaries 

Depth=0: root node (petal length=2.45cm) 
Depth=1: right node splits @ 1.75cm 
Stops at max_depth = 2. 


++ H + Y + 


Vertical dotted line shows boundary if max_depth set = 3. 


import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib.colors import ListedColormap 


def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris= 
True, legend=False, plot_training=True): 

x1s = np.linspace(axes[0], axes[1], 100) 

x2s = np.linspace(axes[2], axes[3], 100) 

X1, x2 = np.meshgrid(x1s, x2s) 

X new = np.c_[x1.ravel(), x2.ravel()] 

y_pred = clf.predict(X new).reshape(x1.shape) 

custom cmap = ListedColormap(['+fafabo','+9898ff','+a0faa0'] 


plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, li 
newidth=10) 
if not iris: 
custom cmap2 = ListedColormap(['#7d7d58', '#4c4c7f', '#507 
d50']) 
plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8 


if plot_training: 
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris 
-Setosa") 
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris 
-Versicolor") 


plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris 
-Virginica") 
plt.axis(axes) 
is: 
plt.xlabel("Petal length", fontsize=14) 
plt.ylabel("Petal width", fontsize=14) 
else: 
plt.xlabel(r"$x_1$", fontsize=18) 
plt.ylabel(r"$x_2$", fontsize=18, rotation=0) 
if legend: 
plt.legend(loc="lower right", fontsize=14) 


plt.figure(figsize-(8, 4)) 
plot_decision_boundary(tree_clf, X, y) 

plt.plot([2.45, 2.45], [0, 3], "k-", linewidth=2) 
plt.plot( (2-45, 7.5), [1.75, 1.75], "k--", linewidth=2) 
plt.plot([4.95, 4.95], [0, 1.75], "k:", linewidth=2) 
plt.plot([4.85, 4.85], (1.75, 3], "k:", linewidth=2) 
plt.text(1.40, 1.0, "Depth=0", fontsize=15) 
plt.text(3.2, 1.80, "Depth=1", fontsize=13) 
plt.text(4.05, 0.5, "(Depth=2)", fontsize=11) 


#save fig("decision tree decision boundaries plot") 
plt.show() 
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Estimating Class Probabilities 


# Probability of instance 5cm long, 1.5cm wide belonging to any 


one of three nodes above: 


print(tree_clf.predict_proba([[5, 1.5]])) 


# Return class of highest probability (in this case, class #1 


print(tree_clf.predict([[5, 1.5]])) 


IL ©. 0.90740741 0.09259259]] 
[1] 


Training: CART algorithm 


e Split training set in two using feature k and threshold t_k. 
e Searches for pair (k, t k) that returns purest subsets, weighted by size. 
e Cost function to minimize shown below. 
Meft "right G 
m 


HE) = = Cet + right 


Gieft/right Measures the impurity of the left/right subset, 


h 
Bm Meftjright 18 the number of instances in the left/right subset. 


e "Greedy" algorithm; searches for optimum at each level w/o regard for lower 
levels. Not guaranteed to find optimum solution. 


Computational Complexity 


e Typical: O(log2(m)) = independent of #features. (So: very fast prediction 
times.) 


Gini Impurity, or Entropy? 


e Can use entropy measure by setting criterion parameter to "entropy". 


n 
He - 2 Pik log (Pik) 
Pi. 49 


e Dataset's entropy = 0 when it contains instances of only one class. 


e Can use either; Gini impurity = slightly faster. Entropy tends to build slightly 
more balanced trees. 


Regularization Hyperparameters 


e max_depth controls max depth of the DT. Reducing max_depth regularizes 
the model, therefore reduces risk of overfit. 

e Also: min samples split, min samples leaf, min_weight_fraction_leaf, 
max leaf nodes, max features -- increasing min” or reducing max" params 
will regularize the model. 





from sklearn.datasets import make moons 
Xm, ym = make moons(n samples-190, noise=0.25, random state-53) 


deep tree clf1 - DecisionTreeClassifier(random_state=42) 

deep tree clf2 = DecisionTreeClassifier(min samples leaf-4, rand 
om state-12) 

deep tree c1f1.fit(Xm, ym) 

deep tree clf2.fit(Xm, ym) 


plt.figure(figsize-(11, 4)) 

plt.subplot (121) 

plot decision boundary(deep tree clf1, Xm, ym, axes=[-1.5, 2.5, 
-1, 1.5], iris=False) 

plt.title("No restrictions", fontsize=16) 

plt.subplot (122) 

plot decision boundary(deep tree c1f2, Xm, ym, axes=[-1.5, 2.5, 
-1, 1.5], iris=False) 

plt.title("min samples leaf = {}".format(deep_tree_clf2.min_samp 
les_leaf), fontsize=14) 


save fig("min samples leaf plot") 


plt.show() 


No restrictions | min samples leaf = 4 





























Regression 


e Task: Predict a value (instead of a class) for each node. 


from sklearn.tree import DecisionTreeRegressor 


# Quadrat!ic training set + noise 
rnd.seed(42) 

= 200 

= rnd.rand(m, 1) 

A (OE - AE 2 

= y + rnd.randn(m, 1) / 10 


ss E 
Il 


tree regi = DecisionTreeRegressor(random_state=42, max depth-2) 
tree reg2 = DecisionTreeRegressor(random_state=42, max depth-3) 
tree reg1.fit(X, y) 
tree reg2.fit(X, y) 


def plot regression predictions(tree reg, X, y, axes=[0, 1, -0.2 
od y label="Sys'e 

x1 = np.linspace(axes[0], axes[1], 500).reshape(-1, 1) 

y_pred = tree reg.predict(x1) 

plt.axis(axes) 

plt.xlabel("$x 1$", fontsize=18) 

if ylabel: 

plt.ylabel(ylabel, fontsize=18, rotation=0) 
plt.plot(X, y, "b.") 


plt.plot(x1, y pred, "r.-", linewidth=2, label=r"$\hat{y}$") 


plt.figure(figsize=(11, 4)) 
plt.subplot(121) 
plot regression predictions(tree regl, X, y) 
for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, " 
k--")): 
plt.plot([split, split], [-0.2, 1], style, linewidth=2) 
plt.text(0.21, 0.65, "Depth=0", fontsize=15) 
plt.text(0.01, 0.2, "Depth=1", fontsize=13) 
plt.text(0.65, 0.8, "Depth=1", fontsize-13) 
plt.legend(loc="upper center", fontsize-18) 
plt.title("max depth-2", fontsize=14) 


plt.subplot(122) 
plot regression predictions(tree reg2, X, y, ylabel-None) 
for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, " 
k--")): 
plt.plot([split, split], [-0.2, 1], style, linewidth=2) 
for split in (0.0458, 0.1298, 0.2873, 0.9040): 
plt.plot([split, split], [-0.2, 1], "k:", linewidth=1) 
plt.text(0.3, 0.5, "Depth=2", fontsize=13) 
plt.title("max depth-3", fontsize=14) 


plt.show() 


# Predicted value for each region (red line) - avg target value 
of instances in that region. 


max depth—2 | max_depth=3 
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e Instead of trying to minimize impurity (classification) DTs now try to minimize 


r 2 
J(k ty) = E MSE ap + SE MSErgnt where 


MSE: 


tree regi = DecisionTreeRegressor(random_state=42) 

tree reg2 = DecisionTreeRegressor( random state=12, min samples 1 
eaf-10) 

tree reg1.fit(X, y) 

tree reg2.fit(X, y) 


X1 = np.linspace(0, 1, 500).reshape(-1, 1) 


y_predi = tree regi.predict(x1) 


y_pred2 = tree reg2.predict(x1) 


plt. 


plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 
plt. 


plt. 
plt. 
plt. 
plt. 
plt. 
plt. 


f), 


figure(figsize=(11, 4)) 
subplot(121) 
plot(X, y, LE u) 


plot(x1, y_pred1, "r.-", linewidth=2, label=r"$\hat{y}$") 
axis IO 1. 0027 del) 

xlabel("$x_1$", fontsize=18) 

ylabel("$y$", fontsize=18, rotation-0) 

legend(loc-"upper center", fontsize=18) 

title("No restrictions", fontsize=14) 


subplot (122) 

plot y, "b.") 

plot(x1, y_pred2, "r.-", linewidth=2, label=r"$\hat{y}$") 
axis([0, 1, -0.2, 1.1]) 

xlabel("$x_1$", fontsize=18) 

title("min samples leaf-f)".format(tree reg2.min samples lea 
fontsize=14) 


#save fig("tree regression regularization plot") 


plt. 


show( ) 


# left: no regularization (default params): overfitting 


# right: more reasonable. 


No restrictions min samples leaf—10 























Instability 


e DTs strongly favor orthogonal decision boundaries. They are sensitive to 
training set rotations. 
e More generally: DTs are sensitive to training data variations. 


rnd.seed(6) 
Xs = rnd.rand(100, 2) - 0.5 
ys = (Xs[:, 0] > 0).astype(np.float32) * 2 


angle = np.pi / 4 

rotation_matrix = np.array( 
[[np.cos(angle), -np.sin(angle)], 
[np.sin(angle), np.cos(angle)]]) 


Xsr = Xs.dot(rotation_matrix) 


tree clf s = DecisionTreeClassifier(random_state=42) 
tree clf s.fit(Xs, ys) 

tree clf sr - DecisionTreeClassifier(random_state=42) 
tree clf sr.fit(Xsr, ys) 


plt.figure(figsize-(11, 4)) 
plt.subplot(121) 


plot decision boundary(tree clf s, Xs, ys, axes=[-0.7, 0.7, 


, 0.7], iris=False) 
plt.subplot(122) 


20 


plot decision boundary(tree clf sr, Xsr, ys, axes=[-0.7, 0.7, 


.7, 0.7], iris=False) 


#save fig("sensitivity to rotation plot") 
plt.show() 


# left: std linearly separable dataset 
# right: dataset rotated by 45degrees. 


if 


-0 
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Intro 


import numpy as np 
import numpy.random as rnd 
import matplotlib.pyplot as plt 


Voting Classifiers 


e Good classifiers can be built by aggregating predictions of various weaker 
Classifiers, and returning the class that gets the most votes. (A "hard voting" 
Classifier. ) 


ozsa 
(rnd.rand(10000, 10) < heads proba).astype(np.int3 


heads_proba 


coin_tosses 
2) 
cumulative heads ratio = np.cumsum( 

coin tosses, axis-0) / np.arange(1, 10001).reshape(-1, 1) 
#cumulative heads ratio 


plt.figure(figsize-(8,3.5)) 
plt.plot(cumulative heads ratio) 

plt.plot([0, 10000], (0.51, 0.51], "k--", linewidth-2, label="51 
%") 

plt.plot([o, 10000], [0.5, 0.5], "k-", label="50%") 
plt.xlabel( "Number of coin tosses") 

plt.ylabel( "Heads ratio") 

plt.legend(loc-"lower right") 

plt.title("The law of large numbers:") 

plt.axis([0, 10000, 0.42, 0.58]) 

#save fig("law of large numbers plot") 

plt.show() 


Heads ratio 


The law of large numbers: 
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# build a voting classifier in Scikit using three weaker classif 


iers 


from sklearn.model selection import train test split 
from sklearn.datasets import make moons 


# use moons dataset 

X, y = make_moons( 
n_samples=500, 
noise=0.30, 
random state=12) 


X train, X test, y train, y test - train test split( 
X, y, random state-/2) 


from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import VotingClassifier 

from sklearn.linear model import LogisticRegression 
from sklearn.svm import SVC 


log clf = LogisticRegression(random_state=42) 
rnd clf = RandomForestClassifier(random_state=42) 
svm_clf SVC(probability=True, random_state=42) 


# voting classifier = logistic + random forest + SVC 


voting clf = VotingClassifier( 


estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', 


m_clf)], 


SV 


voting-'soft' 


) 
voting clf.fit(X train, y train) 


from sklearn.metrics import accuracy score 


for clf in (log clf, rnd_clf, svmicif, voting Elf): 
clf.fit(X train, y train) 
y_pred = clf.predict(X test) 
print(clf. class . name , accuracy score(y test, y pred) 





LogisticRegression 0.864 
RandomForestClassifier 0.872 
SVC 0.888 

VotingClassifier 0.912 


e If all classifiers can estimate class probabilities (they have a predict_proba() 
method), use Scikit to predict highest class probability, averaged over all 
individual classifiers. (soft voting) 


e Often better than hard voting because it gives more weight to highly confident 
votes. Replace voting="hard" with "soft" & ensure all classifiers can estimate 
class probabilities. (SVC cannot by default -set probability param to True.) 


e This tells SVC to use cross-validation to estimate class probabilities. Slows 
training times & adds a predict_proba() method). 


Bagging & Pasting 


e Another approach: use same training algorithm, but apply it to different 
subsets of the training dataset. 

e bagging: sampling the dataset with replacement. 

e pasting: sampling the dataset without replacement. 


e Final prediction - based on an aggregation function. 
e Predictions can be made in parallel -- good scaling properties. 


from sklearn.datasets import make_moons 

from sklearn.ensemble import BaggingClassifier 
from sklearn.metrics import accuracy_score 

from sklearn.tree import DecisionTreeClassifier 


# Train ensemble of 500 Decision Tree classifiers 

# each using 100 training instances - randomly sampled from trai 
ning set 

# with replacement. 


bag_clf = BaggingClassifier( 
DecisionTreeClassifier(random_state=42), 
n_estimators=500, 
max_samples=100, 
bootstrap=True, # set to False for pasting instead of baggin 


n_jobs=-1, 
random_state=42) 


bag clf.fit(X train, y train) 
y_pred = bag clf.predict(X test) 
print(accuracy score(y. test, y pred)) 


0.904 


tree clf - DecisionTreeClassifier(random_state=42) 
tree clf.fit(X train, y train) 

y_pred_ tree = tree clf.predict(X test) 
print(accuracy_score(y_ test, y pred tree)) 


0.856 


from matplotlib.colors import ListedColormap 


def plot decision boundary(clf, X y, axes=[-1.5, 2.5, -1, 1.5], 
alpha=0.5, contour=True): 


x1s 
x2s 
x1, 


np.linspace(axes[0], axes[1], 100) 
= np.linspace(axes[2], axes[3], 100) 
X2 = np.meshgrid(x1s, x2s) 


X_new = np.c_[x1.ravel(), x2.ravel()] 


y pred = clf.predict(X new).reshape(x1.shape) 
custom cmap = ListedColormap(['+fafabo','+9898ff','+a0faa0'] 


contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, li 
10) 


if contour: 


) 
plt. 
newidth= 
d50']) 
) 
plt. 
plt. 
plt. 
plt. 
plt. 


custom cmap2 = ListedColormap(['#7d7d58', '#4c4c7f', '#507 


plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8 


plot(X[:, 0][y==01, X[:, 1][y==0], "yo", alpha=alpha) 
plot(X[:, 0][y==11, X[:, 1][y==1], "bs", alpha=alpha) 
axis(axes) 

xlabel(r"$x_1$", fontsize=18) 

ylabel(r"$x_2$", fontsize=18, rotation=0) 


plt.figure(figsize=(11,4)) 
plt.subplot(121) 
plot decision boundary(tree clf, X, y) 


pit.title("Decision Tree", fontsize=14) 


plt.subplot (122) 
plot decision boundary(bag clf, X, y) 
plt.title("Decision Trees with Bagging", fontsize-14) 


#save fig("decision tree without and with bagging plot") 
plt.show() 


Decision Tree | Decision Trees with Bagging 














Out of Bag Evaluation 


e Bagging: some instances may be sampled multiple times - others not at all. 
On avg, -63% of training samples are used. Remainder 37% = "out of bag". 
e use oob score=True in Scikit to do automatic oob evaluation after training. 


# oob score : predicts classifier results on test set. 
bag_clf = BaggingClassifier( 
DecisionTreeClassifier(), 
n_estimators=500, 
bootstrap=True, 
n_jobs=-1, 
oob_score=True 
) 
bag_clf.fit(X_train, y train) 
bag clf.oob score 


0.89866666666666661 


# did oob score do a good job? 

from sklearn.metrics import accuracy_score 
y_pred = bag clf.predict(X test) 

accuracy score(y. test,y pred) 


0.90400000000000003 


bag clf.oob decision function 
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Random Patches - Random Subspaces 


e BaggingClassifier supports feature sampling. Params: max features and 
bootstrap. 

e Very useful when handling high-dimensional datasets. 

e "Random patches": sampling features & sampling instances. 

e "Random subspaces": sampling features & keeping all instances. 


Random Forests 


e RF = ensemble of Decision Trees 
Typically trained via bagging 
e RandomForestClassifier: designed for DT classification 


e RandomForestRegressor: designed for regression 


# Train an RF classifier with 500 trees limited to 16 max nodes 
each. 

# splitter-"random": tells RF to search for best feature among 
# a random subset of features. 


bag clf = BaggingClassifier( 
DecisionTreeClassifier ( 
splitter="random", 
max leaf nodes-16, 
random state-1?), 


n_estimators=500, 
max_samples=1.0, 
bootstrap=True, 
n_jobs=-1, 

random state=12) 


bag clf.fit(X train, y train) 
y_pred = bag clf.predict(X test) 


from sklearn.ensemble import RandomForestClassifier 


rnd clf = RandomForestClassifier ( 
n estimators-500, 
max leaf nodes-16, 
n jobs--1, 
random state=12) 


rnd clf.fit(X train, y train) 
y_pred_rf = rnd clf.predict(X test) 


# almost identical predictions 
np.sum(y_pred == y. pred rf) / len(y_pred) 


0.97599999999999998 


Feature importance 


e important features likely to appear closer to root of tree 
e unimportant features likely to appear closer to leaves - if at all. 
e Scikit finds avg depth of feature appearance across all trees in an RF. 


from sklearn.datasets import load iris 
iris = load_iris() 


rnd clf = RandomForestClassifier ( 
n_estimators=500, 
n_jobs=-1, 
random state=12) 


rnd_cl1f.fit(iris["data"], iris["target"]) 


for name, importance in zip( 
iris["feature_names"], 
rnd clf.feature importances ): 
print(name, "-", importance) 


sepal length (cm) - 0.112492250999 
sepal width (cm) = 0.0231192882825 
petal length (cm) = 0.441030464364 
petal width (cm) = 0.423357996355 


rnd clf.feature importances 


array([ 0.11249225, 0.02311929, 0.44103046, 0.423358 ]) 
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plt.figure(figsize-(6, 4)) 


for i in range(15): 
tree clf = DecisionTreeClassifier ( 
max leaf nodes-16, 
random_state=42+1) 


indices with replacement = rnd.randint( 
0, 
len(X_train), 
len(X_train)) 


tree_clf.fit( 
X[indices_with_replacement], 
y[indices_with_replacement ] ) 


plot_decision_boundary( 
tree_clf, X, y, 
axes=[-1.5, 2.5, -1, 1.5], 
alpha=0.02, 
contour=False) 


plt .show( ) 





X1 


Boosting - AdaBoost 


e One strategy: pay more attention to training instances that predecessor 
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underfitted - forces new predictors to concentrate more on the "hard cases". 
e Disadvantage: results depend on previous classifier (sequential), so algo 
cannot be parallelized. Not great for scaling. 


# Plot decision boundaries of five predictors on moons dataset 


m = len(X train) 


plt.figure(figsize-(11, 4)) 
for subplot, learning rate in ((121, 1), (122, 0.5)): 
sample weights = np.ones(m) 
for i in range(5): 
plt.subplot(subplot) 


svm_clf = SVC( 
kernel="rbf", 
C=0.05) 


svm_c1f.fit( 
X train, y train, 
sample weight-sample weights) 


y_pred = svm_clf.predict( 
X_train) 


sample weights|y pred != y train] *= (1 + learning rate) 


plot decision boundary ( 
svm cif, 


X, y, 
alpha=0.2) 


plt.title("learning_rate = {}".format(learning rate - 1) 
fontsize-16) 
plt.subplot (121) 
plt.text(-0.7, -0.65, "1", fontsize=14) 


plt.text(-0.6, -0.10, "2", fontsize=14) 
plt.text(-0.5, 0.10, "3", fontsize=14) 


ch07 ensemble learning.md 


plt.text(-0.4, 0.55, "4", fontsize=14) 
plt.text(-0.3, 0.90, "5", fontsize-14) 
#save fig("boosting plot") 

plt.show() 


# left: ist clf gets many wrong, so 2nd clf gets boosted values. 
# right: same sequence, but learning rate cut in half. 


learning rate = 0 = learning rate = -0.5 
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# train AdaBoost classifier on 200 decision stumps (DS) 
# DS - decision tree with max depth-1 


from sklearn.ensemble import AdaBoostClassifier 


ada clf = AdaBoostClassifier( 
DecisionTreeClassifier(max depth-i), n_estimators=200, 
algorithm-"SAMME.R", learning rate-0.5, random states42 
) 
ada clf.fit(X train, y train) 
plot decision boundary(ada clf, X, y) 
plt.show() 
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Boosting - Gradient Boosting 


e Similar to AdaBoost (continually correcting the predecessors in an ensemble. 
Instead of tweaking instance weights on each iteration, GB fits the predictor 
to the residual errors of the previous predictor. 


from sklearn.tree import DecisionTreeRegressor 


# training set: a noisy guadratic function 
rnd.seed(42) 

rnd.rand(100, 1) - 0.5 

2x. 0l -2 0-05. ranantL00) 


< Xx 
I Il 


# train Regressor 


tree_reg1 = DecisionTreeRegressor(max_depth=2, 


tree reg1.fit(X, y) 


# now train 2nd Regressor using errors made by 


y2 = y - tree reg1.predict(X) 


tree reg2 = DecisionTreeRegressor(max_depth=2, 


tree reg2.fit(X, y2) 


# now train 3rd Regressor using errors made by 


y3 = y2 - tree_reg2.predict(X) 


tree_reg3 = DecisionTreeRegressor(max_depth=2, 


tree_reg3.fit(X, y3) 


X_new = np.array([[0.8]]) 


# now have ensemble w/ three trees. 


y_pred = sum(tree.predict(X_new) for tree in ( 


tree regl, tree reg2, tree reg3)) 


print(y pred) 


[ 0.75026781] 


def plot_predictions( 
regressors, X, y, axes, 
label-None, 
style="r-", 
data_style="b.", 
data label-None): 


random state=12) 


1st one. 


random state=12) 


2nd one. 


random state=12) 


X1 = np.linspace(axes[0], axes[1], 500) 


y_pred = sum( 
regressor .predict(x1.reshape(-1, 1)) for regressor in re 
gressors) 


plt.plot(X[:, 0], y, data_style, label=data_label) 
plt.plot(x1, y_pred, style, linewidth=2, label-label) 
if label or data_label: 

plt.legend(loc="upper center", fontsize-16) 
plt.axis(axes) 


plt.figure(figsize=(11,11)) 


plt.subplot(321) 

plot predictions( (tree regi], X, y, axes=[-0.5, 0.5, -0.1, 0.8], 
label-"$h 1(x 1)$", style-"g-", data label-"Training set") 
plt.ylabel("$y$", fontsize=16, rotation=0) 

plt.title("Residuals and tree predictions", fontsize=16) 


plt.subplot(322) 

plot predictions( (tree regi], X, y, axes=[-0.5, 0.5, -0.1, 0.8], 
label-"$h(x 1) = h_1(x_1)$", data label-"Training set") 
plt.ylabel("$y$", fontsize=16, rotation=0) 

plt.title("Ensemble predictions", fontsize=16) 


plt.subplot (323) 

plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.5, 0.5] 
, label="$h_2(x_1)$", style="g-", data_style="k+", data_label="R 
esiduals") 

plt.ylabel("$y - h 1(x 1)$", fontsize=16) 


plt.subplot (324) 

plot predictions( (tree regi, tree reg2], X, y, axes=[-0.5, 0.5, 
-0.1, 0.8], label="Sh(x_1) = h1(x 1) + hak 18") 
plt.ylabel("$y$", fontsize=16, rotation=0) 


plt.subplot(325) 
plot predictions( (tree reg3], X, y3, axes=[-0.5, 0.5, -0.5, 0.5] 
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, Jabel-"$h 3(x 1)$", style="g-", data style-"kt") 
plt.ylabel("$y - h1(x 1) - h_2(x_1)$", fontsize=16) 
plt.xlabel("$x 1$", fontsize=16) 


plt.subplot(326) 

plot predictions( (tree regi, tree reg2, tree reg3], X, y, axes=[ 
0-5, 025,, 701 Dal. Tab hi 1) E MIE NE 
SPS) 

plt.xlabel("$x 1$", fontsize-16) 

plt.ylabel("$y$", fontsize=16, rotation=0) 


#save fig("gradient boosting plot") 
plt.show() 


# ist row: ensemble = only one tree: predictions match ist tree. 
# 2nd row: new tree trained on residual errors of 1st tree. 

# 3rd row: " à 

# result: ensemble predictions get better as trees are added. 
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e learning rate param controls contribution of each tree. Low values (ex: 0.1) = 
need more trees in ensemble to fit training set, but predictions usually 
generalize better. (This is called shrinkage.) 


# two GBRT ensembles trained with low learning rate 
from sklearn.ensemble import GradientBoostingRegressor 


gbrt - GradientBoostingRegressor ( 
max. depth-2, 
n estimators-3, 
learning rate-0.1, 
random state=12) 


gbrt.fit(X, y) 


gbrt slow = GradientBoostingRegressor ( 
max_depth=2, 
n_estimators=200, 
learning_rate=0.1, 
random state=12) 


gbrt_slow.fit(X, y) 


plt.figure(figsize=(11,4)) 


plt.subplot (121) 
plot predictions( 
[gbrt], X, y, 
axes=[-0.5, 0.5, -0.1, 0.8], 
label="Ensemble predictions") 
plt.title("learning rate={}, n_estimators={}".format(gbrt.learni 
ng rate, gbrt.n estimators), fontsize=14) 


plt.subplot (122) 
plot predictions( 
[gbrt_slow], X, y, 
axes=[-0.5, 0.5, -0.1, 0.8]) 
plt.title("learning_rate={}, n_estimators={}".format(gbrt_slow.1 
earning_rate, gbrt_slow.n_estimators), fontsize=14) 


#save fig("gbrt learning rate plot") 
plt.show() 


# left: not enough trees (underfits) 
# right: too many trees (overfits) 








learning rate—0.1, n_estimators=3 — learning rate—0.1, n_estimators=200 
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e To find optimal number of trees - use early stopping method. 
e staged predict method: returns iterator 


from sklearn.model selection import train test split 
from sklearn.metrics import mean sguared error 


X train, X val, y train, y val = train test split(X, y) 
# train GRBR regressor with 120 trees 


gbrt - GradientBoostingRegressor ( 
max depth-2, 
n estimators-120, 
learning rate-0.1, 
random state=12) 


abre. fit (xX train, y train) 


# measure MSE validation error at each stage 

errors = (mean sguared error(y val, y pred) for y_pred in gbrt.s 
taged predict(X val)| 

errors 


[0.05877146809545241, 
0.050146609664278821, 
0.042693525239940654, 
0.036758764317358611, 
0.032342621749728441, 


ONS so, lo) (oo, Om OOOO Om oo ES tor io) fo) Jo) io) to io, Io) OO to) Io) Mo) loo fo) to) io to) ON (ojo) J0) io) 


.028407668512271105, 
.024897554253370889, 
.022344405311247584, 
.019535997367701449, 
.017423553892941333, 
.015298227412102105, 
.013614891608372095, 
.01241865401978786, 
.01114950733723946, 
.010131360091843384, 
.0091854704682465919, 
.0085684302891776056, 
.0078525358395017328, 
.0072105819722258777, 
.0067708705683962693, 
.0062415649764643415, 
.0058360573276457243, 
.0053862983457847987, 
.0051345071507873903, 
.0048692096567381805, 
.0045993749990593299, 
.0043550054844811968, 
.0041542481413648245, 
.0039794595160053785, 
.0038058301746231277, 
.0036528925611761264, 
.0035903310836105469, 
.0035078898256137104, 
.0034145667924260869, 
.0033091498103360911, 
.0032216349333429491, 
.0031684358902285465, 
.0031067035318094903, 
.0030811367114601672, 
.0030602631146299077, 
.003000040093686018, 
.0029246869254349805, 
.0028559321605494477, 
.0028308419421558683, 
.0028218777360194264, 


(Oy (2) so, lo) (oo, (oO) @p qo) (co, OMC oo io to io fo) io, jo) oo fo) OOO Sor Ko) loo, fr te io Io) fo) Io) (oto) oY io) 


.0027941065824977074, 
.0027733228935542496, 
.0027805517665357811, 
.0027523772234700978, 
.0027297064654860348, 
.0027248578787871292, 
.0027111390401517179, 
.0027041926119007326, 
.0026930464329994377, 
.0027047076934144398, 
.0027194180251317295, 
.0027010027055809748, 
.0026976053707465464, 
.0026946405089738347, 
.0026713744909731395, 
.0026633491003786457, 
.0026694977341077202, 
.0026594592750579836, 
.0026425819418378605, 
.0026524409142755744, 
.0026418897165154491, 
.0026483360802177103, 
.0026456393608631189, 
.0026465080389023671, 
.0026396693211148074, 
.002649273120700455, 
.002643721514468783, 
.0026463988198929221, 
.0026333618213948747, 
.0026314011519099879, 
.0026349113355268257, 
.0026387528659342825, 
.0026345585421650142, 
.0026355886319374901, 
.0026310345391991532, 
.0026519658939712061, 
.0026467 700098620557, 
.00264498239665715, 

.0026475491456891486, 
.0026474836942911913, 


ok .io) (orvoo) SONORO OOM Om Om On On Om oo, Io) Om Om OP io: On loyo, Io) tol so) io) io, Io) (oto) io) to) 


.0026530458155365681, 
.0026478335004093052, 
.0026564768881028435, 
.0026574608795571115, 
.0026537575609276061, 
.0026559108292476983, 
.0026528848367343987, 
.0026533895549644779, 
.0026520896622857252, 
.0026416985817433059, 
.0026497886163651938, 
.0026430582537166087, 
.0026548742317473117, 
.002660275592603878, 

.0026582571161537366, 
.0026570823709750535, 
.0026557538081706522, 
.002675470519360824, 

.0026762761989050578, 
.0026742086578626454, 
.0026957941482744232, 
.0026964801899977998, 
.0026939578807501376, 
.0026959742963617757, 
.0026949319702616616, 
.0026988916344244736, 
.0027169473218451121, 
.0027148017926961689, 
.0027192710134859655, 
.0027358435370699618, 
.0027346474658663323, 
.0027351047440069571, 
.0027459941366245631, 
.0027441324932851491, 
.002756368378237764] 


train another GBRT ensemble using opti 


best n estimators = np.argmin(errors) 
min error = errors[best_n_estimators] 


gbrt best - GradientBoostingRegressor ( 
max depth-2, 
n estimators-best n estimators, 
learning rate-0.1, 
random state=12) 


gbrt best.fit(X train, y train) 


GradientBoostingRegressor(alpha-0.9, criterion-'friedman mse', i 
nit-None, 

learning rate-0.1, loss='ls', max depth-2, max feat 
ures-None, 

max. leaf nodes-None, min impurity split-1e-07, 

min samples leaf-1, min samples split-2, 

min weight fraction leaf-0.0, n estimators-79, pres 
ort-'auto', 

random state-42, subsample=1.0, verbose=0, warm sta 
rt-False) 


plt.figure(figsize-(11, 4)) 


plt.subplot(121) 

plt.plot(errors, "b.-") 

plt.plot( (best n estimators, best n estimators], (9, min error], 
"k--") 

plt.plot([0, 120], [min error, min error], "k--") 
plt.plot(best n estimators, min error, "ko") 
plt.text(best n estimators, min error*1.2, "Minimum", ha="center" 
, fontsize-14) 

plt.axis([0, 120, 0, 0.01]) 

plt.xlabel( "Number of trees") 

plt.title("Validation error", fontsize=14) 


plt.subplot(122) 
plot predictions( (gbrt best], X, y, axes=[-0.5, 0.5, -0.1, 0.8]) 
plt.title("Best model (55 trees)", fontsize=14) 


#save fig("early stopping gbrt plot") 
plt.show() 
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e Another method: actually stopping training early 
e Implement via warm_start=True (tells Scikit to keep existing trees when fit() is 
called - allowing incremental training.) 


gbrt - GradientBoostingRegressor ( 
max depth-2, 
n_estimators=1, 
learning_rate=0.1, 
random_state=42, 
warm_start=True) 


min_val_error = float("inf") 
error going up = 0 


idation error doesn't improve for 





for n estimators in range(1, 120): 
gbrt.n estimators - n estimators 
gbrt.tit(X train, Y train) 
y_pred = gbrt.predict(X val) 
val error = mean sguared error(y val, y pred) 


if val error « min val error: 
min val error - val error 
error going up = 0 
else: 
error going up += 1 
if error going up == 
break # early stopping 


print(gbrt.n estimators) 


59 


Stacking 


e Instead of using a voting function to aggregate an ensemble's predictor 
outputs, instead train a model to do the aggregation. ("blending".) 
e Blender training: common approach = use a holdout set. 
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# todo: stacking implementation 
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Intro 


e Dimesionality reduction is lossy. It may speed up training but can degrade 
result quality. Also makes pipelines more complex. Try using original data 
before considering dimensionality reduction. 

e Very useful for visualization (2D, 3D representations more intuitive.) 


e Two main approaches: projection, manifold learning. 


e Three most popular techniques: PCA, Kernel PCA, LLE. 


Curse of Dimensionality 


e Many things behave differently in high-D space. 
1) Most points in high-D hypercube will be very close to a border. 


2) Distances between random points much greater (very high probability of sparse 
matrix representation). 


e In 2D: -0.52 
e In 3D: -0.66 
e In 1,000,000D: ~408 ~ sqrt(1000000/6) 


Approaches: Projection 


e Most dataset features are concentrated in a few dimensions - not uniformly 
across all. Much learnable training can be found in low-D subspace. 


import numpy as np 
import numpy.random as rnd 


# build a 3D dataset 


rnd.seed(4) 

m = 60 

wi, w2 = 0.1, 0.3 
noise = 0.1 


angles = rnd.rand(m) * 3 * np.pi / 2 - 0.5 
X = np.empty((m, 3)) 


X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * rnd.randn( 
m) / 2 

X[:, 1] = np.sin(angles) * 0.7 + noise * rnd.randn(m) / 2 

X[:, 2] = X[:, 0] * wi + X[:, 1] * w2 + noise * rnd.randn(m) 


# mean-normalize the data 
X = X - X.mean(axis=0) 


# apply PCA to reduce to 2D 
from sklearn.decomposition import PCA 


pca 
X2D 


PCA(n_components = 2) 


pca.fit_transform(X) 


# recover 3D points projected on 2D plane 
X2D_inv = pca.inverse_transform(X2D) 


# Utility to draw 3D arrows 
from matplotlib.patches import FancyArrowPatch 
from mpl_toolkits.mplot3d import proj3d 


Class Arrow3D(FancyArrowPatch): 
def Minit (self, XS, ys; zs, “args, “*kwargs): 
FancyArrowPatch. init (self, (0,0), (0,9), *args, **kw 
args) 
self._verts3d = xs, ys, ZS 


def draw(self, renderer): 
xs3d, ys3d, zs3d = self._verts3d 
xs, yS, ZS = proj3d.proj_transform(xs3d, ys3d, zs3d, ren 
derer.M) 
self.set_positions((xs[0],ys[0]1),(xs[1],ys[1])) 
FancyArrowPatch.draw(self, renderer) 


# express plane as function of x,y 
axes = [-1.8, 1.8, -1.3, 1.3, -1.0, 1.0] 


x1s = np.linspace(axes[0], axes[1], 10) 
x2s = np.linspace(axes[2], axes[3], 10) 
X1, x2 = np.meshgrid(x1s, x2s) 


C = pca.components 
R = C.T.dot(C) 
z = (R[0, 2] * x1 + R[1, 2] * x2) 7 (1 Re, 21) 


# plot 3D dataset, plane & projections 


import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D 


fig = plt.figure(figsize=(10, 10)) 
ax = fig.add_subplot(111, projection='3d') 


X3D above 
X3D below 


X[X[:, 2] > X2D_inv[:, 2]] 
X[X[:, 2] <= X2D_inv[:, 2]] 


ax.plot(X3D_below[:, 0], X3D_below[:, 1], X3D_below[:, 2], "bo", 
alpha-0.5) 


ax.plot surface(x1, x2, z, alpha=0.2, color="k") 
np.linalg.norm(C, axis-0) 
ax.add artist(Arrow3D([0, ero, 0]],[0, Cle, 1]1,[0, clo, 2]], mu 


tation_scale=15, lw=1, arrowstyle="-|>", color="k")) 
ax. add. artist(Arrow3D([0O, cli, 011,10, cit, 111,10, CR, 211, mu 
tation_scale=15, lw=1, arrowstyle="-|>", color="k")) 


ax.plot([0], [0], [0], "k.") 


for i in range(m): 
af XE 2> ED ahve. 2: 
ax.plot([X[i][0], X2D_inv[i][0]], [X[i][1], X2D_inv[i][1 
11, [X[i][2], X2D inv(ilt2)1, "k-") 
else: 
ax.plot([X[i][0], X2D_inv[i][0]], [X[i][1], X2D_inv[i][1 
1], [X[1]1[2], X2D_inv[i][2]], "k-", color="#505050" ) 


ax.plot(X2D_inv[:, 0], X2D_inv[:, 1], X2D_inv[:, 2], "k+") 
ax.plot(X2D_inv[:, Ol, X2D_inv[:, 1], X2D_inv[:, 2], "k.") 
ax.plot(X3D_above[:, 0], X3D_above[:, 1], X3D_above[:, 2], "bo") 
ax.set_xlabel("$x_1$", fontsize=18) 

ax.set ylabel("$x 2$", fontsize=18) 

ax.set zlabel("$x 3$", fontsize-18) 

ax.set_xlim(axes[0:2]) 

ax.set_ylim(axes[2:4]) 

ax.set zlim(axes|4:6)) 


#save fig("dataset 3d plot") 
plt.show() 
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# 2D projection eguivalent: 

fig - plt.figure() 

ax = fig.add subplot(111, aspect='equal' >) 

ax.plot(X2D[:, 0], X2D[:, 1], “kF") 

ax.plot(X2D[:, 0], X2D[:, 1], kk) 

ax.plot([0], [0], "ko") 

ax.arrow(0, 0, 0, 1, head_width=0.05, length_includes_head=True, 
head_length=0.1, fc='k', ec='k') 

ax.arrow(0, 0, 1, ©, head_width=0.05, length_includes_head=True, 
head_length=0.1, fc='k', ec='k') 

“19” fontsize=18) 

ax.set_ylabel("$z_2$", fontsize=18, rotation=0) 

ar axis (da, -1.2,. 1:21) 

ax.grid(True) 

plt.show() 


ax.set xlabel("$z 
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Approaches: Manifolds 


e Manifolds = shapes that can be bent/twisted in higher-D space. 


e ex: "Swiss roll" problem 


# Swiss roll visualization: 


from 
E 
) 


axes 


sklearn.datasets import make swiss roll 


make swiss roll(n samples=1000, noise=0.2, 


(dy Ne Kos abi 


fig - plt.figure(figsize-(8, 6)) 
ax = fig.add_subplot(111, projection='3d' ) 


random _state=42 


ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=t, cmap=plt.cm.hot) 


ax.view_init(10, 


-70) 


ax.set xlabel("$x 1$", fontsize=18) 
ax.set ylabel("$x 2$", fontsize=18) 
ax.set zlabel("$x 3$", fontsize-18) 


ax.set_xlim(axes[0:2]) 
ax.set_ylim(axes[2:4]) 
ax.set_zlim(axes[4:6]) 


#save fig("swiss roll plot") 


plt.show() 








# "squashed" swiss roll visualization: 
plt.figure(figsize=(11, 4)) 


plt.subplot(121) 

plt.scatter(X[:, 0], X[:, 1], c=t, cmap=plt.cm.hot) 
plt.axis(axes[:4]) 

plt.xlabel("$x 1$", fontsize=18) 
plt.ylabel("$x_2$", fontsize-18, rotation=0) 
plt.grid(True) 


plt.subplot (122) 

plt.scatter(t, X[:, 1], ct, cmap=plt.cm.hot) 
plt.axis( (4, 15, axes[2], axes[3]]) 
plt.xlabel("$z 1$", fontsize-18) 
plt.grid(True) 


#save fig("sguished swiss roll plot") 
plt.show() 




















from matplotlib import gridspec 


axes AE 1 2020 ED lo] 


x2s = np.linspace(axes[2], axes[3], 10) 
x3s = np.linspace(axes[4], axes[5], 10) 
X2, X3 = np.meshgrid(x2s, x3s) 


fig - plt.figure(figsize-(6, 5)) 
ax = plt.subplot(111, projection='3d' ) 


positive_class = X[:, 0] >5 
X_pos = X[positive_class] 
X_neg = X[~positive_class] 


ax.view_init(10, -70) 

ax.plot(X_neg[:, 01, X_neg[:, 1], X megf:, 2], "y^") 
ax.plot_wireframe(5, x2, x3, alpha=0.5) 
ax.plot(X_pos[:, 0], Xpos[:, 1], X pos", 2], las!) 
ax.set_xlabel("$x_1$", fontsize=18) 

ax.set ylabel("$x 2$", fontsize=18) 

ax.set zlabel("$x 3$", fontsize-18) 
ax.set_xlim(axes[0:2]) 

ax.set_ylim(axes[2:4]) 

ax.set zlim(axes|4:6)) 


#save fig("manifold decision boundary plot1") 
plt.show() 


fig = plt.figure(figsize-(5, 4)) 
ax = plt.subplot(111) 


plt.plot(tIpositive class], X[positive class, i], "gs") 
plt.plot(t[-positive class], X[-positive class, 1], "yA") 
plt.axis( (4, 15, axes[2], axes[3]]) 

plt.xlabel("$z 1$", fontsize=18) 

plt.ylabel("$z 2$", fontsize=18, rotation-0) 
plt.grid(True) 


#save fig("manifold decision boundary_plot2") 
plt.show() 


fig - plt.figure(figsize-(6, 5)) 
ax = plt.subplot(111, projection-'3d") 


positive_class = 2 * (t[:] - 4) > X[:, 1] 
X_pos = X[positive_class] 
X[-positive class] 


X_neg 
ax.view_init(10, -70) 

ax.plot(X_neg[:, 0], X_neg[:, 1], X_-neg[:, 2], YAN) 
ax.plot(X_pos[:, 0], X_pos[:, 1], X_pos[:, 2], "gs") 
ax.set_xlabel("$x_1$", fontsize=18) 

ax.set ylabel("$x 2$", fontsize=18) 

ax.set zlabel("$x 3$", fontsize-18) 
ax.set_xlim(axes[0:2]) 

ax.set_ylim(axes[2:4]) 

ax.set_zlim(axes[4:6]) 


#save_fig("manifold_decision_boundary_plot3") 
plt.show() 


fig = plt.figure(figsize=(5, 4)) 
ax = plt.subplot(111) 


plt.plot(tIpositive class], X[positive class, i], "gs") 
plt.plot(t[-positive class], X[-positive class, i], "y^") 
plt.plot((4, 15], Io, 22], "b-", linewidth=2) 
plt.axis([4, 15, axes[2], axes[3]]) 
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plt.xlabel("$z 1$", fontsize-18) 
plt.ylabel("$z 2$", fontsize-18, rotation=0) 
plt.grid(True) 


#save fig("manifold decision boundary plot4") 
plt.show() 


# Lesson learned (below): 
# Unrolling a dataset to a lower dimension doesn't necessarily 1 


ead to 
# a simpler representation. 
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PCA (Principal Component Analysis) 


e Most popular DR algorithm 
e 1) Finds hyperplane that lies closest to the data 
e 2) Projects data onto it 


Preserving Variance 


e Below: simple 2D dataset projected onto 3 different axes. 
e Projection on solid line preserves the maximum variance. (Therefore less 
likely to lose information.) 


angle = np.pi / 5 
stretch = 5 


177 


200 


.seed(3) 


rnd.randn(m, 2) / 10 
X.dot(np.array([[stretch, 0],[0, 1]])) # stretch 
X.dot([[np.cos(angle), np.sin(angle)], [-np.sin(angle), np.c 


os(angle)]]) + rotate 


np.array([np.cos(angle), np.sin(angle)]) 


u2 np.array([np.cos(angle - 2 * np.pi/6), np.sin(angle - 2 * n 
p.pi/6)]) 
u3 np.array([np.cos(angle - np.pi/2), np.sin(angle - np.pi/2)] 


) 


X_proji = X.dot(ul.reshape(-1, 1)) 
X_proj2 = X.dot(u2.reshape(-1, 1)) 
X_proj3 = X.dot(u3.reshape(-1, 1)) 


plt. 
plt. 
plt. 


figure(figsize=(8,4)) 
subplot2grid((3,2), (0, 0), rowspan=3) 
plot([-1.4, 1.4], [-1.4*u1[1]/u1[0], 1.4*u1[1]/u1[0]], "k-", 


linewidth=1) 


plt. 


plot({-1.-4, 1 ap AA 1 Adv] ei] uk" 


, linewidth=1) 


plt. 


plot([-1.4, 1.4], [-1.4*u3[1]/u3[0], 1.4*u3[1]/u3[0]], "k:", 


linewidth=2) 


plt. 
plt. 
plt. 


plot(X[:, 0], X[:, 1], "bo", alpha=0.5) 
ais (lit 14 AA 141) 
arrow(0, 0, u1[0], u1[1], head_width=0.1, linewidth=5, lengt 


h_includes_head=True, head_length=0.1, fc='k', ec='k') 


plt. 


arrow(0, ©, u3[0], u3[1], head_width=0.1, linewidth=5, lengt 


h_includes_head=True, head length-0.1, fc='k', ec='k') 


plt. 


22) 


plt. 
plt. 
plt. 
plt. 


plt. 


text(u1[0] + 0.1, u1[1] - 0.05, r"$\mathbf{c_1}$", fontsize- 


text(u3[0] + 0.1, u3[1], r"$\mathbf{c_2}$", fontsize=22) 
xlabel("$x_1$", fontsize=18) 

ylabel("$x_2$", fontsize=18, rotation=0) 

grid(True) 


subplot2grid((3,2), (0, 1)) 


plt.plot([-2, 2], Io, 0], "k-", linewidth-1) 
plt.plot(X proj1[:, 0], np.zeros(m), "bo", alpha-0.3) 
plt.gca().get_yaxis().set_ticks([]) 
plt.gca().get_xaxis().set_ticklabels([]) 
plt.axis([-2, 2, -1, 11) 

plt.grid(True) 


plt.subplot2grid((3,2), (1, 1)) 

plt.plot([-2, 21, Io, 0], "k--", linewidth=1) 
plt.plot(X proj2[:, 0], np.zeros(m), "bo", alpha-0.3) 
plt.gca().get_yaxis().set_ticks([]) 
plt.gca().get_xaxis().set_ticklabels([]) 
plt.axis([-2, 2, -1, 11) 

plt.grid(True) 


plt.subplot2grid( (3,2), (2, 1)) 

plt.plot([-2, 21, Io, 0], "k:", linewidth=2) 
plt.plot(X proj3[:, 0], np.zeros(m), "bo", alpha-0.3) 
plt.gca().get_yaxis().set_ticks([]) 

plt.axis([-2, 2, -1, 1]) 

plt.xlabel("$z 1$", fontsize=18) 

plt.grid(True) 


#save fig("pca best projection") 
plt.show() 
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Principal Components 


e PCA finds axis responsible for largest amount of variance in dataset. 
e Also finds 2nd axis, responsible for next largest amount. 

e If higher-D dataset, PCA also finds 3rd axis... 

e Repeat for # of dimensions in the dataset. 

e Each axis vector is called a principal component. (PC) 


e PCs found using Singular Value Decomposition (SVD), a matrix 
factorization technique. 


e SVD decomposes training set matrix X into dot product of three matrices. 
e Note: PCA assumes data is centered around origin. Scikit PCA will adjust 
data for you if needed. 


X_centered = X - X.mean(axis=0) 
U,s,V = np.linalg.svd(X_centered) 


ci, C2 VA lle OV Te dl 
print(c1,c2) 


[-0.79644131 -0.60471583] [-0.60471583 0.79644131] 


Projecting Training Data Down to d Dimensions 


e Done by computing dot product of training data (X) by matrix containing the 
first d principal components (Wd). 


W2 = V.T[:, :2] 
X2D = X centered.dot(W2) 
print(X2D) 
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Scikit PCA 


e Uses SVD decomposition as before. 
e You can access each PC using components variable. ( 


from sklearn.decomposition import PCA 
pca = PCA(n_components = 2) 

X2D = pca.fit_transform(X) 
print(pca.components_[0]) 


print(pca.components_.T[:,0]) 


[ -0.79644131 -0.60471583] 
(-0.79644131 -0.60471583] 


Explained Variance Ratio 


e Very useful metric: proportion of dataset's variance along the axis of each PC 
component. 


asel IMLIance ex CO Y 1S 


print(pca.explained_variance_ratio_) 


[ 0.95369864 0.04630136] 


Choosing Right #Dimensions 


e No need to choose arbitrary #dimensions. Instead pick d that cumulatively 
accounts for a sufficient amount, ex: 95%. 


ind minimum d to preserve 95% of training set variance 
pca = PCA() 
pca.fit(X) 
cumsum = np.cumsum(pca.explained_variance_ratio_) 
d = np.argmax(cumsum >= 0.95) + 1 
print(d) 


PCA for Compression 


e Example applying PCA to MNIST dataset with 95% preservation = results in 
~150 features (original = 28x28 = 784) 


#MNIST compression: 
from sklearn.model selection import train test split 
from sklearn.datasets import fetch_mldata 


#mnist = fetch mldata( 'MNIST original") 
mnist path - "./mnist-original.mat" 


from scipy.io import loadmat 
mnist raw = loadmat(mnist path) 
mnist = { 
"data": mnist_raw["data"].T, 
"target": mnist_raw["label"][0], 
"COL NAMES": ["label", "data"], 
"DESCR": "mldata.org dataset: mnist-original", 


} 


X, y = mist["data"], mnist["target"] 
X train, X test, y train, y test = train test split(X, y) 


X = X train 
pca = PCA() 
pca.fit(X) 


d = np.argmax(np.cumsum(pca.explained_variance_ratio_) >= 0.95) 
ap A 
d 


154 


pca = PCA(n_components=0.95) 
X_reduced = pca.fit_transform(X) 
pca.n components . 


154 


# did you hit your 95% minimum? 
np.sum(pca.explained variance ratio ) 


0.9503623084769206 


# use inverse_transform to decompress back to 784 dimensions 


X mnist = X train 


pca = PCA(n components = 154) 


X mnist reduced = pca.fit transform(X mnist) 
X mnist recovered = pca.inverse transform(X mnist reduced) 


import matplotlib 
import matplotlib.pyplot as plt 


def plot digits(instances, images per row-5, **options): 
size - 28 
images per row - min(len(instances), images per row) 
images = [instance.reshape(size,size) for instance in instan 
ces] 
n_rows = (len(instances) - 1) // images_per_row + 1 
row_images = [] 
n_empty = n_rows * images_per_row - len(instances) 
images.append(np.zeros((size, size * n_empty))) 
for row in range(n rows): 
rimages = images[row * images_per_row : (row + 1) * imag 
es_per_row] 
row_images.append(np.concatenate(rimages, axis=1)) 
image = np.concatenate(row_images, axis=0) 
plt.imshow( image, cmap = matplotlib.cm.binary, **options) 
plt.axis("off") 


plt.figure(figsize=(7, 4)) 

plt.subplot (121) 

plot digits(X mnist[::2100]) 
plt.title("Original", fontsize=16) 
plt.subplot (122) 

plot digits(X mnist recovered[::21001]) 
plt.title("Compressed", fontsize=16) 
#save fig("mnist compression plot") 
plt.show() 


Original Compressed 
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Incremental PCA 


e PCA normally requires entire dataset in memory for SVD algorithm. 
e Incremental PCA (IPCA) splits dataset into batches. 


# split MNIST into 100 minibatches using Numpy array_split() 
# reduce MNIST down to 154 dimensions as before. 
# note use of partial fit() for each batch. 


from sklearn.decomposition import IncrementalPCA 


n_batches = 100 
inc_pca = IncrementalPCA(n_components=154) 


for X batch in np.array split(X mnist, n batches): 
print(".", end="") 


inc_pca.partial_fit(X_batch) 


X_mnist_reduced_inc = inc_pca.transform(X_mnist) 


# alternative: Numpy memmap class (use binary array on disk as i 
f it was in memory) 

filename = "my_mnist.data" 

X_mm = np.memmap( 


filename, dtype-'float32', mode='write', shape-X mnist.shape 


X_mm[:] = X mnist 
del X mm 


X mm = np.memmap(filename, dtype='float32', mode='readonly', sha 
pe=X_mnist.shape) 


batch size = len(X mnist) // n batches 
inc pca - IncrementalPCA(n components-154, batch size-batch size 


) 


inc pca.fit(X mm) 


IncrementalPCA( batch size-525, copy=True, n_components=154, whit 
en=False) 


rnd_pca = PCA( 
n_components=154, 
random_state=42, 
svd_solver="randomized" ) 


X reduced = rnd pca.fit transform(X mnist) 


import time 


for n components in (2, 10, 154): 
print("n components =", n components) 
regular_pca = PCA( 

n_components=n components) 
inc_pca = IncrementalPCA( 

n_components=154, 

batch size-500) 
rnd pca = PCA( 

n components-154, 

random state-/2, 

svd solver-"randomized") 


for pca in (regular pca, inc pca, rnd pca): 
t1 = time.time() 
pca.fit(X_mnist) 
t2 = time.time() 
print(pca. class . name , t2 - t1, "seconds") 





n components - 2 

PCA 1.308387279510498 seconds 
IncrementalPCA 18.326093673706055 seconds 
PCA 3.998342514038086 seconds 

n components - 10 

PCA 1.4705824851989746 seconds 
IncrementalPCA 16.598721742630005 seconds 
PCA 4.156355619430542 seconds 
n_components = 154 

PCA 4.129154682159424 seconds 
IncrementalPCA 16.597434043884277 seconds 
PCA 4.0131142139434814 seconds 


Randomized PCA 


e Stochastic algorithm, quickly finds approximation of 1st d components. 
Dramatically faster. 


rnd pca = PCA(n_components=154, svd solver-"randomized") 


t1 - time.time() 

X reduced = rnd pca.fit transform(X mnist) 
t2 = time.time() 

print(t2-t1, "seconds") 


4.414088487625122 seconds 


Kernel PCA 


e Use kernel trick to map instances into higher-D feature spaces. This enables 
non-linear classification & regression with SVMs. 
e Good at preserving clusters after projecton. 


H Below: Swiss roll reduced to 2D using 3 techniques: 
# 1) linear kernel (equiv to PCA) 

# 2) RBF kernel 

# 3) sigmoid kernel (logistic) 


from sklearn.decomposition import KernelPCA 


X, t = make swiss roll( 
n_samples=1000, 
noise=0.2, 
random state=12) 


lin pca = KernelPCA( 
n_components = 2, 
kernel="linear", 
fit inverse transform-True) 


rbf pca - KernelPCA( 
n components = 2, 
kernel="rbf", 
gamma=0.0433, 
fit_inverse_transform=True) 


sig pca = KernelPCA( 
n components = 2, 
kernel-"sigmoid", 
gamma=0.001, 
coef0=1, 
fit inverse transform=True) 


Ye > 609 


plt.figure(figsize-(11, 4)) 


for subplot, pca, title in ( 
(131, lin pca, "Linear kernel"), 
(132, rbf pca, "RBF kernel, $\gamma=0.04$"), 
(133, sig pca, "Sigmoid kernel, $\gamma=104{-3}, r=1$")): 


X reduced - pca.fit_transform(X) 
if subplot == 132: 
X reduced rbf = X reduced 


plt.subplot(subplot) 
#plt.plot(X reduced[y, 0], X_reduced[y, 1], "gs") 
#plt.plot(X reduced[-y, 0], X_reduced[-y, 1], "YA") 
plt.title(title, fontsize=14) 
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt. 
cm.hot) 

plt.xlabel("$z 1$", fontsize-18) 
if subplot == 131: 

plt.ylabel("$z 2$", fontsize-18, rotation-0) 
plt.grid(True) 


#save fig("kernel pca plot") 
plt.show() 


Linear kernel RBF kernel, y— 0.04 Sigmoid kernel, y= 107? r= 1 
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Selecting a Kernel & Hyperparameters 


e Dimensionality reduction = prep for supervised learning task 
e Can use grid search to select kernel & params 


from sklearn.model_selection import GridSearchCV 
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline 


clf = Pipeline([ 
("kpca", KernelPCA(n_components=2)), 
("log reg", LogisticRegression())]) 


param grid = [{ 
"kpca gamma": np.linspace(0.03, 0.05, 10), 


"kpca kernel": ["rbf", "sigmoid"]}] 


grid search = GridSearchCV(clf, param grid, cv=3) 
grid search.fit(X, y) 


# best kernel & params? 
print(grid search.best params ) 


('kpca gamma': 0.043333333333333335, 'kpca__kernel': 'rbf'} 


e Another (unsupervised approach): select kernel 8 params with lowest 
reconstruction error. Not as easy as with linear PCA. 


rbf_pca = KernelPCA( 
n_components = 2, 
kernel="rbf", 
gamma=0.0433, 
fit_inverse_transform=True) + perform reconstruction 


X_reduced rbf_pca.fit_transform(X) 


X_preimage = rbf_pca.inverse_transform(X_reduced) 
+ return reconstruction pre-image error 


from sklearn.metrics import mean_squared_error 
mean_squared_error(X, X_preimage) 


32.786308795766082 


times rpca = (| 

times pca - [] 

sizes = (1000, 10000, 20000, 30000, 40000, 50000, 70000, 
100000, 200000, 500000] 


for n_samples in sizes: 


X = rnd.randn(n samples, 5) 


pca = PCA( 
n_components = 2, 
random state=12, 
svd solver-"randomized") 


t1 = time.time() 
pca.fit(X) 

t2 = time.time() 

times rpca.append(t2 - t1) 


pca - PCA(n components - 2) 


t1 - time.time() 
pca.fit(X) 

t2 = time.time() 
times_pca.append(t2 - t1) 


plt.plot(sizes, times_rpca, "b-o", label="RPCA") 
plt.plot(sizes, times pca, "r-s", label="PCA") 
plt.xlabel("n samples") 

plt.ylabel( "Training time") 

plt.legend(loc="upper left") 

plt.title("PCA and Randomized PCA time complexity ") 
plt.show() 


PCA and Randomized PCA time complexity 
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LLE (Locally Linear Embedding) 


e Powerful nonlinear dimensionality reduction tool 

e Manifold Learning; doesn't rely on projections. 

e LLE measures how each instance relates to closest neighbors, then looks for 
low-D representation where local relations are best preserved. 


# Use LLE to unroll a Swiss Roll. 


from sklearn.manifold import LocallyLinearEmbedding 


X, t = make swiss roll( 
n_samples=1000, 
noise=0.2, 
random_state=41) 


lle = LocallyLinearEmbedding( 
n_neighbors=10, 
n_components=2, 
random state=12) 


X reduced = lle.fit_transform(X) 


plt.title("Unrolled swiss roll using LLE", fontsize-14) 
plt.scatter(X_reduced[:, 0], X reducedf:, 1], c=t, cmap=p1t.cm.h 
ot) 

plt.xlabel("$z 1$", fontsize=18) 

plt.ylabel("$z 2$", fontsize=18) 

plt.axis([-0.065, 0.055, -0.1, 0.12]) 

plt.grid(True) 


#save fig("lle unrolling plot") 
plt.show() 


Unrolled swiss roll using LLE 

















1st: For each instance, LLE finds k nearest neighbors & tries to reconstruct 
instance as linear function of neighbors (weights such that sguared distance 
is minimum). 

e Weight matrix W now encodes all local linear relations between instances. 

e 2nd: Map instances into d-dimensional space & preserve relationship data 

e Scikit computational complexity: 

e finding K nearest neighbors: O(m x log(m) x n x log(k)) 

e weight optimization: O(m x n x k23) 

e constructing low-d representations: O(d x mA2) 


MDS, Isomap, t-SNE, LDA 


from sklearn.manifold import MDS 
mds = MDS(n_components=2, random_state=42) 
X_reduced_mds = mds.fit_transform(X) 


from sklearn.manifold import Isomap 
isomap = Isomap(n_components=2 ) 
X_reduced_isomap = isomap.fit_transform(X) 


from sklearn.manifold import TSNE 
tsne = TSNE(n_components=2) 
X_reduced_tsne = tsne.fit_transform(X) 


from sklearn.discriminant_analysis import LinearDiscriminantAnal 
ysis 

lda = LinearDiscriminantAnalysis(n components=2) 

X_mnist = mnist["data"] 

y_mnist = mnist["target"] 

lda.fit(X mnist, y_mnist) 

X reduced Ida = lda.transform(X_mnist) 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/sklearn/discr 
iminant_analysis.py:387: UserWarning: Variables are collinear. 
warnings.warn("Variables are collinear.") 


titles = ["MDS", "Isomap", "t-SNE"] 


plt.figure(figsize=(11,4)) 


for subplot, title, X_reduced in zip((131, 132, 133), titles, 
(X_reduced_mds, X_reduced_i 

somap, X reduced tsne)): 

plt.subplot(subplot) 

plt.title(title, fontsize=14) 

plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt. 
cm.hot) 

plt.xlabel("$z 1$", fontsize-18) 

if subplot == 131: 

plt.ylabel("$z 2$", fontsize-18, rotation-0) 
plt.grid(True) 


#save fig("other dim reduction plot") 
plt.show() 


Isomap tSNE 
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Intro 


e Graphs defined in Python, executed in C++ 

e Open-sourced 2015. Windows, Linux, macOS, iOS, Android 

e Python API: tensorflow.contrib.learn (trains NNs) 

e Simpler API: tensorflow.contrib.slim (simplifies building NNs) 

e Other high-level APIs: Keras, Pretty Tensor. 

e Libraries: Caffe, DeepLearning4J, H20, MXNet, Theano, Torch 
e TensorBoard visualization tool 

e Cloud service 

e Resources: home page, GitHub, StackOverflow 


Installation & Test 


$ cd 
$ source env/bin/activate (if using virtualenv) 
$ pip3 install --upgrade tensorflow (or tensorflow-gpu for GPU support) 


$ python3 -c 'import tensorflow; print(tensorflow.version)' 


Ipython3 -c 'import tensorflow; print(tensorflow. version )' 


1.0.0 


import numpy as np 


First Graph 


# create your first graph 
import tensorflow as tf 

x = tf.Variable(3, name="x") 
y = tf.Variable(4, name="y") 
T S KKV EV 2 


# run graph by opening a session 


sess = tf.Session() 
sess.run(x.initializer) 
sess.run(y.initializer) 
result = sess.run(f) 
print(result) 
sess.close() 


42 


# for repeated session "runs" 

with tf.Session() as sess: 
x.initializer.run() 
y.initializer.run() 


result = f.eval() 


print(result) 


42 


# use global variables initializer() to set up initialization 
init = tf.global variables initializer() 


with tf.Session() as sess: 
init.run() # actually initialize all the variables 
result = f.eval() 

print(result) 


42 

# interactive sessions (from within Jupyter or Python shell) 
# interactive sesions are auto-set as default sessions 

sess - tf.InteractiveSession() 

init.run() 

result = f.eval() 


print(result ) 
sess.close() 


42 


Managing Graphs 


# any created node = added to default graph 
x1 = tf.Variable(1) 
x1.graph is tf.get default graph() 


True 


# handling multiple graphs 


graph - tf.Graph() 
with graph.as default (): 
X2 = tf.Variable(2) 


x2.graph is graph, x2.graph is tf.get default graph() 


(True, False) 


Node Lifecycles 


# TF finds node's dependencies & evaluates them first 


w = tf.constant(3) 
X = W + 2 
y=x +5 
2 xs 


# previous eval results = NOT reused. above code evals w & x twi 
ce. 


with tf.Session() as sess: 
print(y.eval()) 
print(z.eval()) 


# amore efficient evaluation call: 
with tf.Session() as sess: 
y_val, z_val = sess.run([y,z]) 
print(y_val) 
print(z_val) 


10 
15 
10 
15 


Linear Regression with TF 


e TF ops take any number of inputs € produce any number of outputs 

e Constants & variables = source ops (no inputs) 

e Inputs & outputs = multidimensional "tensors" = NumPy ndarrays in Python 
API. Typically floats, can also be strings. 


Below: Linear Regression on 2D arrays (California Housing 
dataset) 


a = EE 
b = np.zeros((6,4)) 
c = np.ones((6,2)) 


np.c_[a,b,c] 


l, 
l, 
l, 
l, 
l, 
11) 


array([[ 
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from sklearn.datasets import fetch california housing 
import numpy as np 


housing - fetch california housing() 
m, n = housing.data.shape 


1 
np.c_[np.ones((m, 1)), housing.data] 


Il 


# add bias feature, x0 


housing data plus bias 





print(m,n,housing data plus bias.shape) 





20640 8 (20640, 9) 


tf.reset default graph() 


X = tf.constant( 
housing data plus bias, 
dtype=tf.float64, name="X") 





print("X shape: ",X.shape) 


# housing.target - 1D array. Reshape to col vector to compute th 
eta 
# reshape() accepts -1 - "unspecified" for a dimension. 


y = tf.constant( 
housing.target.reshape(-1, 1), 
dtype-tf 5 float64, name="y" ) 


XT = tf.transpose(X) 


print("XT shape: ",XT.shape) 


# normal equation: theta = (XT * X)A-1 * XT * y 


theta = tf.matmul( 
tf.matmul( 
tf.matrix_inverse( 
tf.matmul(XT, X)), 
XT), 


y) 
# TF doesn't immediately run the code. It creates nodes that wil 
1 run with eval(). 


# TF will auto-run on GPU if available. 


with tf.Session() as sess: 
result = theta.eval() 


print("theta: An", result) 


X shape: (20640, 9) 
XT shape: (9, 20640) 
theta: 

IL -3.69419202e+01] 

[ 4.36693293e-01] 
9.43577803e-03] 
-1.07322041e-01] 
6.45065694e-01 | 
.97638942e-06] 
-3.78654265e- 03 ] 
-4,21314378e-01] 
-4.34513755e-01]] 
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compare to pure NumPy 


housing data plus bias 





? 
y = housing.target.reshape(-1, 1) 
theta_numpy = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y) 


print("theta: An",theta_numpy) 


theta: 

IL -3.69419202e+01] 

[ 4.36693293e-01] 
9.43577803e-03] 
-1.07322041e-01] 
6.45065694e-01] 
.97638942e-06] 
-3.78654265e-03] 
-4,21314378e-01] 
-4.34513755e-01]] 
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from sklearn.linear model import LinearRegression 
lin reg - LinearRegression() 


lin reg.fit( 
housing.data, 
housing.target.reshape(-1, 1)) 


print("theta: \n",np.r_[ 
lin_reg.intercept_.reshape(-1, 1), 
lin_reg.coef_.T]) 


theta: 

IL -3.69419202e+01] 

[ 4.36693293e-01] 
9.43577803e-03] 
-1.07322041e-01] 
6.45065694e-01 | 
.97638942e-06 | 
-3.78654265e- 03] 
-4,21314378e-01] 
-4.34513755e-01]] 
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Batch Gradient Descent (instead of Normal 
Equation): 


e Could use TF; let's use Scikit first. 


# normalize input features first 


wer. 


from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler( ) 
scaled housing data = scaler.fit transform( 


housing.data) 


scaled housing data plus bias - 
np.ones( (m, 1)), 
scaled housing data] 


np.c_[ 





import pandas as pd 
pd.DataFrame(scaled_housing_data_plus_bias).info() 





<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 20640 entries, O to 20639 
Data columns (total 
20640 
20640 
20640 
20640 
20640 
20640 
20640 
20640 
20640 non-null 
dtypes: float64(9) 

1.4 MB 


9 columns): 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 


non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 


N OU B WN FP O 


non-null 


memory usage: 











print("mean (axis=0): \n",scaled_ housing data plus 
s=0)) 
print("mean (axis=1): An",scaled_housing_data_plus 
s=1)) 
print("mean (w/bias): An",scaled_housing_data_plus 
print("data shape: \n", scaled_housing_data_plus 





bias 


bias 


bias 


- Otherwise training = much slo 


.mean(axi 


.mean(axi 


.mean( )) 
bias. 


shape) 


mean (axis-0): 


[ 1.00000000e+00 6.60969987e-17 5.50808322e-18 6.6096998 
7e-17 
-1.06030602e-16 -1.10161664e-17 3.44255201e-18 -1.07958431 
e-15 
-8.52651283e-15] 
mean (axis=1): 
[ 0.38915536 0.36424355 0.5116157 ..., -0.06612179 -0.063605 


87 
0.01359031] 
mean (w/bias): 
0.111111111111 
data shape: 
(20640, 9) 


Manual gradient computation 


theta(next) theta earning rate = MSE(theta) 
tf.reset_default_graph() 


n_epochs = 1000 
learning_rate = 0.01 


X = tf.constant( 
scaled_ housing data plus bias, 
dtype=tf.float32, name="X") 





y = tf.constant( 
housing.target.reshape(-1, 1), 
dtype=tf.float32, name="y") 


theta = tf.Variable( + tf.random_uniform = generates 


tf.random_uniform([n+1, 1], -1.0, 1.0, seed=42), 
name-"theta") 


y_pred = tf.matmul( 
X, theta, name-"predictions") 


error = Y pred - y 
mse = tf.reduce mean(tf.sguare(error), name="mse") 
gradients = 2/m * tf.matmul(tf.transpose(X), error) 


training_op = tf.assign(theta, theta - learning_rate * gradients 


) 


init = tf.global_variables_initializer() 


with tf.Session() as sess: 
sess.run(init) 


for epoch in range(n epochs): 
if epoch % 100 == 0: + do every 100th epoch: 
print("Epoch", epoch, "MSE =", mse.eval()) 
sess.run(training op) 


best theta - theta.eval() 


print("Best theta: Nn" best theta) 


Epoch 0 MSE - 2.75443 
Epoch 100 MSE - 0.632222 
Epoch 200 MSE - 0.57278 
Epoch 300 MSE = 0.558501 
Epoch 400 MSE = 0.549069 
Epoch 500 MSE = 0.542288 
Epoch 600 MSE = 0.537379 
Epoch 700 MSE = 0.533822 
Epoch 800 MSE = 0.531243 
Epoch 900 MSE = 0.529371 
Best theta: 

IL  2.06855226e+00] 

[ 7.74078071e-01] 
1.31192386e-01] 
-1.17845096e-01] 
1.64778158e-01] 
.44080753e-04] 
-3.91945168e-02] 
-8.61356616e-01] 
-8.23479712e-01]] 
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Using autodiff 


e automatically finds gradients. Note the different gradients assignment. 


tf.reset_default_graph() 


n_epochs = 1000 
learning_rate = 0.01 


X = tf.constant( 
scaled_ housing data plus bias, 
dtype=tf.float32, name="X"') 





y = tf.constant( 
housing.target.reshape(-1, 1), 
dtype=tf.float32, name="y") 


theta = tf.Variable( 
tf.random_uniform([n + 1, 1], -1.0, 1.0, 
seed=42), name-"theta") 


y_pred = tf.matmul( 
X, theta, name="predictions") 


error y_pred - y 


tf.reduce mean(tf.square(error), name="mse") 


mse 


# AutoDiff to the rescue 

# creates list of ops, one/variable, to find gradients per varia 
ble 

gradients = tf.gradients(mse, [theta])[0] 

# 


training op = tf.assign(theta, theta - learning rate * gradients 


) 


init = tf.global variables initializer() 


with tf.Session() as sess: 
sess.run(init) 


for epoch in range(n epochs): 
if epoch % 100 == 
print("Epoch", epoch, "MSE =", mse.eval()) 
sess.run(training op) 


best theta - theta.eval() 


print("Best theta: An", best theta) 


Epoch O MSE - 2.75443 


Epoch 100 MSE - 0.632222 
Epoch 200 MSE - 0.57278 
Epoch 300 MSE - 0.558501 
Epoch 400 MSE - 0.549069 
Epoch 500 MSE - 0.542288 
Epoch 600 MSE = 0.537379 
Epoch 700 MSE = 0.533822 
Epoch 800 MSE = 0.531243 
Epoch 900 MSE = 0.529371 


Best theta: 

IL 2.06855249e+00] 

[ 7.74078071e-01] 
1.31192386e-01] 
-1.17845066e-01] 
1.64778143e-01] 
.44078017e-04] 
-3.91945094e-02] 
-8.61356676e-01] 
-8.23479772e-01]] 
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Four ways to run autodiff - see Appendix D 


e reverse-mode (default): best for many inputs, few outputs 
e symbolic diff: high accuracy 

e forward mode: high accuracy 

e numerical diff: low accuracy, but trivial to implement 


Using a predefined optimizer (Gradient Descent) 


tf.reset_default_graph() 


n_epochs = 1000 
learning_rate = 0.01 


X = tf.constant( 
scaled housing data plus bias, 





dtype=tf.float32, name="X"') 


y = tf.constant( 
housing.target.reshape(-1, 1), 
dtype=tf.float32, name="y") 


theta = tf.Variable( 
tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), 
name-"theta") 


y_pred = tf.matmul( 
X, theta, name="predictions") 


error = y_pred - y 

mse = tf.reduce_mean(tf.square(error), name="mse") 

EH 

optimizer = tf.train.GradientDescentOptimizer(learning rate-le 


arning rate) 
training op = optimizer.minimize(mse) 


HHHHH 
FHHHH 


init = tf.global_variables_initializer() 


with tf.Session() as sess: 
sess.run(init) 


for epoch in range(n epochs): 
if epoch % 100 == 
print("Epoch", epoch, "MSE -", mse.eval()) 
sess.run(training op) 


best theta - theta.eval() 


print("Best theta:\n", best theta) 


Epoch 0 MSE - 2.75443 
Epoch 100 MSE - 0.632222 
Epoch 200 MSE - 0.57278 
Epoch 300 MSE = 0.558501 
Epoch 400 MSE = 0.549069 
Epoch 500 MSE = 0.542288 
Epoch 600 MSE = 0.537379 
Epoch 700 MSE = 0.533822 
Epoch 800 MSE = 0.531243 
Epoch 900 MSE = 0.529371 
Best theta: 

IL 2.06855249e+00] 

[ 7.74078071e-01] 

[ 1.31192386e-01] 

[ -1.17845066e-01] 

[ 1.64778143e-01] 

[ 7.44078017e-04] 

[ -3.91945094e-02] 

[ -8.61356676e-01] 

[ -8.23479772e-01]] 


Using a predefined optimizer (Momentum) 


tf.reset default graph() 


n epochs - 1000 
learning rate - 0.01 


X = tf.constant( 
scaled_ housing data plus bias, 
dtype=tf.float32, name="X") 





y = tf.constant( 
housing.target.reshape(-1, 1), 
dtype=tf.float32, name-"y") 


theta = tf.Variable( 
tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), 
name-"theta") 


y_pred = tf.matmul(X, theta, name="predictions") 
error = y_pred - y 
mse = tf.reduce_mean(tf.square(error), name="mse") 





optimizer = tf.train.MomentumOptimizer ( 
learning rate-learning rate, 
momentum-0.25) 





training op - optimizer.minimize(mse) 
init = tf.global variables initializer() 


with tf.Session() as sess: 
sess.run(init) 


for epoch in range(n epochs): 
sess.run(training op) 


best theta - theta.eval() 


print("Best theta:Nn", best theta) 


Best theta: 

IL 2.06855392e+00] 

[ 7.94067979e-01] 
1.25333667e-01] 
-1.73580602e-01] 
2.18767926e-01] 
.64708309e-03] 
-3.91250364e-02] 
-8.85289013e-01] 
-8.50607991e-01]] 
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Training & Data Feeds 


e Goal: modify previous code for Minibatch gradient descent 
e Best practice: placeholder nodes (no computation, just data output) 


tf.placeholder(tf.float32, shape-(None, 3)) 
B=A+5 


with tf.Session() as sess: 
B val 1 = B.eval( 
feed dict-fA: [[1, 2, 3]]}) 


B val 2 = B.eval( 
feed_dict={A: [[4, 5, 6], [7, 8, 9117) 


print(B val 1, "An", B_val_2) 


[[ 6. 7. 8.]] 
[[ 9. 10. 11.] 


ii ea ai 


# definition phase: change X,y to placeholder nodes 
tf.reset default graph() 


n epochs = 1000 
learning rate - 0.01 


FRA AR 
X = tf.placeholder(tf.float32, shape=(None, n+1), name="X") 
y = tf.placeholder(tf.float32, shape=(None, 1), name="y") 
FFE 


theta = tf.Variable( 
tf.random_uniform([n+1, 1], -1.0, 1.0, seed=42), 
name-"theta") 


y_pred = tf.matmul( 
X, theta, name-"predictions") 


error = y pred - y 


mse = tf.reduce mean(tf.sguare(error), name="mse" ) 


optimizer = tf.train.GradientDescentOptimizer ( 
learning rate-learning rate) 


training op - optimizer.minimize(mse) 


init = tf.global variables initializer() 


# execution phase: fetch minibatches one-by-one. 
# use feed dict to provide values to dependent nodes 


import numpy.random as rnd 

n epochs - 10 

batch size - 100 

n batches - int(np.ceil(m / batch size)) 
print me im, NA, nobatenes: *, n batehes, * xn”) 


def fetch_batch(epoch, batch_index, batch_size): 


rnd.seed(epoch * n_batches + batch_index) 
indices = rnd.randint(m, size=batch_size) 


X_batch 
y_batch = housing.target.reshape(-1, 1)[indices] 


scaled housing data plus bias|indices| 





return X batch, y batch 


with tf.Session() as sess: 
sess.run(init) 


for epoch in range(n epochs): 
for batch index in range(n batches): 


X batch, y batch - fetch batch( 
epoch, batch index, batch size) 


sess.run( 
training op, 
feed dict-fX: X batch, y: y batch)) 


best theta - theta.eval() 


print("Best theta: “n" best theta) 


m: 20640 
n_batches: 207 


Best theta: 

IL 2.07001591] 
. 82045609] 
.1173173 | 
. 22739051] 
. 31134021] 
. 00353193] 
.01126994 | 
91643935] 
.87950081]] 
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Model Save/Restore 


e Use a saver node once construction is complete. 
e call save() method - pass it session and filepath info. 


tf.reset default graph() 


n epochs - 1000 
learning rate - 0.01 


X = tf.constant( 
scaled_ housing data plus bias, 
dtype=tf.float32, name="X") 





y = tf.constant( 
housing.target.reshape(-1, 1), 
dtype=tf.float32, name-"y") 


theta = tf.Variable( 
tf.random_uniform([n+1, 1], -1.0, 1.0, seed=42), 


name-"theta") 


y_pred = tf.matmul( 
X, theta, name="predictions") 


error = y_pred - y 
mse = tf.reduce_mean( 
tf.square(error), 


name="mse" ) 


optimizer = tf.train.GradientDescentOptimizer ( 
learning_rate=learning_rate) 


training op = optimizer .minimize(mse) 
init = tf.global variables initializer() 
saver - tf.train.Saver() 


# can specify which vars to save: 


# saver = tf.train.Saver({"weights": theta}) 


with tf.Session() as sess: 


sess.run(init) 


for epoch in range(n epochs): 
if epoch % 100 == 
print("Epoch", epoch, "MSE =", mse.eval()) 
save path = saver.save(sess, "/tmp/my model.ckpt") 
sess.run(training op) 


best theta - theta.eval() 
save path = saver.save(sess, "my model final.ckpt") 


print("Best theta:\n",best_theta) 


# 


model restoration: 


# 1) create Saver at end of construction phase 


# 2) call saver.restore() at start of execution 


Epoch © MSE = 2.75443 


Epoch 100 MSE = 0 
Epoch 200 MSE = 0 
Epoch 300 MSE = 0 
Epoch 400 MSE = 0 
Epoch 500 MSE = 0.542288 
Epoch 600 MSE = 0 
Epoch 700 MSE = 0 
Epoch 800 MSE = 0 
0 


. 632222 
.57278 

. 558501 
. 549069 


. 537379 
. 533822 
. 531243 


Epoch 900 MSE = 0.529371 
Best theta: 
[[ 2.06855249e+00] 


[ 


ee dl el lui te 


7.74078071e-01] 
1.31192386e-01] 


-1.17845066e-01] 


1.64778143e-01] 
7.44078017e-04] 


-3.91945094e-02] 
-8.61356676e-01] 
-8.23479772e-01]] 


Icat checkpoint 


model checkpoint path: "my model final.ckpt" 
all model checkpoint paths: "/tmp/my model.ckpt" 


all model checkpoint paths: "my model final.ckpt" 


Visualization - inside Jupyter 


from IPython.display import clear output, Image, display, HTML 


def strip consts(graph def, max const size-32): 
"""Strip large constant values from graph def.""" 
strip def - tf.GraphDef() 
for nO in graph def.node: 
n = strip def.node.add() 
n.MergeFrom(n0) 
if n.op == 'Const': 
tensor = n.attr['value'].tensor 
size = len(tensor.tensor_content) 


if size > max_const_size: 
tensor.tensor content = b"<stripped %d bytes>"%s 


ize 
return strip_def 


def show_graph(graph_def, max_const_size=32): 
"""Yisualize TensorFlow graph.""" 
if hasattr(graph def, 'as_graph_def'): 
graph def = graph def.as graph def() 


strip def - strip consts(graph def, max const size-max const 


size) 


code — TY TE NE 
<script> 
function load() {{ 


document .getElementById("{id}").pbtxt = {data}; 
3} 
</script> 
<link rel="import" href-"https://tensorboard.appspot.com 
/tf-graph-basic.build.html" onload=load()> 
<div style="height :600px"> 
<tf-graph-basic id="{id}"></tf-graph-basic> 
</div> 
"u" format(data=repr(str(strip_def)), id='graph'+str(np.rand 
om.rand())) 


# original was width-1200px, height-620px 
iframe = """ 
<iframe seamless style="width: 1200px;height:620px;border 
:0" srcdoc="{}"></iframe> 
M, format(code.replace('"', '&quot;')) 
display(HTML(iframe)) 


show_graph(tf.get_default_graph()) 


<iframe seamless style="width:1200px;height:620px;border:0" 
srcdoc=" 
<script> 
function load() { 
document .getElementById(&guot : graphO.1784179106002547&qu 
ot;).pbtxt = 'node (in name: &guot:X&guot:Nn op: &guot:Const&g 


uot;\n attr (in key: &quot;dtype&quot; \n value {\n t 
ype: DT FLOATNn An Jin attr An key: &guot:value&guot:N 
n value (An tensor (An dtype: DT_FLOATAn t 
ensor_shape {\n dim {\n size: 20640\n 
An dim {\n size: Mn An 
An tensor_content: &quot;<stripped 743040 bytes>&quot;\ 
n }\n An }\n}\nnode {\n name: &quot;y&quot;\n op: & 
uot;Const&quot;\n attr {\n key: &quot;dtype&quot; n value 
{\n type: DT_FLOAT\n An Jin attr (An key: &quot;v 
alue&quot;\n value {\n tensor {\n dtype: DT_FLOAT 
\n tensor_shape {\n dim {\n size: 206 


40\n An dim (An size: lAn 


}\n An tensor_content: &quot;<stripped 82560 byt 
es>&quot ; \n }\n An }\n}\nnode {\n name: &quot;random_ 
uniform/shape&quot;\n op: &quot;Const&quot;\n attr (An key: 

&quot;dtype&quot; \n value {\n type: DT_INT32\n An 
Jn attr (An key: &quot;value&quot;\n value {\n tens 
or {\n dtype: DT INT321n tensor shape (An 

dim (An size: 2\n An An ten 
sor content: &guot; \\t \\000\\000\\000\\001\\000\\000\\0008&quot ; N 
n An Jn MNnMnnode fin name: &guot: random uniform/mi 
n&gguot:Nn op: &quot;Const&quot;\n attr (in key: &quot;dtype 
&quot;\n value (An type: DT FLOATNn abro {\ 
n key: &quot;value&quot;\n value {\n tensor (An 

dtype: DT_FLOATAn tensor shape (An }\n fl 
oat_val: -1.0\n Jn An }\n}\nnode {\n name: &quot;ran 
dom_uniform/max&quot;\n op: &quot;Const&quot;\n attr {\n ke 
y: &quot;dtype&quot; n value {\n type: DT_FLOAT\n An 

An attr (Ain key: &guot:value&guot :Nn value (An te 
nsor (An dtype: DT_FLOATAn tensor_shape (An 

An float_val: 1.0\n An An }\n}\nnode [An na 
me: &quot;random_uniform/RandomUniform&quot;\n op: &quot;Random 
Uniform&quot;\n input: &quot;random_uniform/shape&quot;\n attr 

{\n key: &quot; T&quot; n value {\n type: DT_INT32\n 

An Jin attr An key: &quot;dtype&quot; n value {\n 
type: DT_FLOAT\n An Jin attr An key: &quot;seed&qu 
ot;\n value {\n i: 87654321\n An Jin attr An k 
ey: &quot;seed2&quot; \n value {\n i: 42\n An }\n}\n 
node {\n name: &quot;random_uniform/sub&quot;\n op: &quot;Sub& 
quot;\n input: &quot;random_uniform/max&quot;\n input: &quot;r 
andom_uniform/min&quot;\n attr {\n key: &quot; T&quot; n v 
alue {\n type: DT FLOATNn Jn }\n}\nnode {\n name: &qu 
ot; random_uniform/mul&quot;\n op: &quot;Mul&quot;\n input: &qu 
ot; random_uniform/RandomUniform&quot;\n input: &guot: random uni 

form/sub&quot;\n attr (An key: &quot; T&quot;\n value {\n 
type: DT_FLOAT\n Jn }\n}\nnode [Ain name: &quot; random 
_uniform&quot;\n op: &quot;Add&quot;\n input: &guot: random uni 
form/mul&quot;\n input: &quot;random_uniform/min&quot;\n attr 

{\n key: &quot; T&quot;\n value {\n type: DT_FLOAT\n 
An }\n}\nnode {\n name: &quot;theta&quot;\n op: &quot;Vari 
ableV2&quot;\n attr {\n key: &quot;container&quot;\n valu 
e {\n s: &quot;&quot;\n An }\n attr An key: &quot 


;dtype&quot ; \n value {\n type: DT FLOATNn An Jin a 


ttr {\n key: &quot;shape&quot;\n value {\n shape {\n 
dim {\n size: 9\n An dim {\n 
size: 1\n }\n An A RN key: 
&quot ; shared_name&quot; \n value {\n s: &quot;&quot;\n 


Jn }\n}\nnode fin name: &quot;theta/Assign&quot;\n op: &qu 
ot;Assign&quot;\n input: &quot;theta&quot;\n input: &quot;rand 
om_uniform&quot;\n attr {\n key: &quot; T&quot;\n value {\ 


n type: DT_FLOAT\n An Jin attr (in key: &quot; cla 
ss&quot ; \n value {\n list {\n s: &quot;loc:@theta 
&quot;\n An An Jin attr (An key: &quot;use lockin 
g&quot;\n value {\n b: true\n An Jin attr An k 
ey: &quot;validate shape&quot;\n value {\n b: true\n 


An JAnjinnode (An name: &quot;theta/read&quot;\n op: &guot:I 
dentity&quot;\n input: &quot;theta&quot;\n attr (An key: &q 


uot; T&quot; n value {\n type: DT_FLOAT\n An Jin at 
tr {\n key: &quot; class&quot;\n value {\n list {\n 
s: &quot; loc: @theta&quot;\n An Jn }\n}\nnode {\ 


n name: &quot;predictions&quot;\n op: &quot;MatMul&quot;\n in 

put: &guot:X&guot:Nn input: &quot;theta/read&quot;\n attr {\n 
key: &quot;T&quot; n value {\n type: DT_FLOAT\n IN 

n }\n attr An key: &guot:transpose a&guot; n value (An 
b: false\n A attr {\n key: &guot:transpose b& 
quot;An value (An b: false\n An }\n}\nnode [An nam 
e: &quot;sub&quot;\n op: &quot;Sub&quot;\n input: &guot:predic 
tions&quot;\n input: &quot;y&quot;\n attr (Ain key: &quot;T& 
quot; n value {\n type: DT_FLOAT\n An }\n}\nnode AN 

n name: &quot;Square&quot;\n op: &quot;Square&quot;\n input: 
&quot;sub&quot;\n attr (Ain key: &quot;T&quot;\n value {\n 
type: DT_FLOAT\n Jn  }\n}\nnode {\n name: &quot;Const 
&quot;\n op: &quot;Const&quot;\n attr (An key: &quot;dtype& 


quot; \n value {\n type: DT_INT32\n An Jin attr An 
key: &quot;value&quot; `n value {\n tensor {\n 
dtype: DT_INT32\n tensor_shape {\n dim {\n 
size: 2\n An An tensor_content: 
&quot ; \\000\\000\\000\\000\\001\\000\\000\\000&quot ; \n }\n 


An }\n}\nnode {\n name: &quot;mse&quot;\n op: &quot;Mean& 
quot;\n input: &quot;Square&quot;\n input: &quot;Const&quot;\n 
attr {\n key: &quot;T&quot;\n value {\n type: DT_FL 
OAT\n An Jin attr An key: &quot;Tidx&quot;\n value 


{\n type: DT_INT32\n An Jin attr An key: &quot;ke 
ep dims&guot; \n value (An b: false\n An }\n}\nnode 
{\n name: &quot;gradients/Shape&quot;\n op: &guot:Consteguot:N 


n attr {\n key: &quot;dtype&quot; \n value {\n type: 
DT_INT32\n An Jin attr (in key: &quot; value&quot; \n 
value {\n tensor {\n dtype: DT_INT32\n tensor 
_shape {\n dim {\n An An An 


Jn }\n}\nnode {\n name: &quot;gradients/Const&quot;\n op: &g 
uot;Const&quot;\n attr {\n key: &quot;dtype&quot; \n value 


{\n type: DT_FLOAT\n An Jin attr An key: &quot;v 
alue&quot;\n value {\n tensor {\n dtype: DT FLOAT 
\n tensor_shape {\n An float_val: 1.0\n 


An An }\n}\nnode {\n name: &quot;gradients/Fill&quot;\ 
n op: &quot;Fill&quot;\n input: &quot;gradients/Shape&quot;\n 
input: &quot;gradients/Const&quot;\n attr {\n key: &quot;T& 
quot; n value {\n type: DT_FLOAT\n An }\n}\nnode AN 
n name: &guot:gradients/mse grad/Reshape/shape&guot:Nn op: &qu 
ot;Const&quot;\n attr (in key: &quot;dtype&quot;\n value 


{\n type: DT_INT32\n An Jin attr {\n key: &quot;va 
lue&quot ; \n value {\n tensor {\n dtype: DT_INT32\ 
n tensor_shape {\n dim {\n size: 2\n 
An An tensor content: &quot; \\001\\000\ 
\000\\000\\001\\000\\000\\000&quot ¡An An Jn NnMnnod 


e (in name: &quot;gradients/mse_grad/Reshape&quot;\n op: &quot 
;Reshape&quot;\n input: &quot;gradients/Fill&quot;\n input: &g 
uot;gradients/mse_grad/Reshape/shape&quot;\n attr (in key: & 
quot; T&quot;\n value {\n type: DT_FLOAT\n id a 
ttr (An key: &quot;Tshape&quot ; \n value {\n type: DI. 
INT32Nn }\n MNnMnnode fin name: &guot:gradients/mse grad/T 
ile/multiples&quot;\n op: &quot;Const&quot;\n attr (in key: 


&quot ; dtype&quot ; \n value {\n type: DT_INT32\n }\n 
An attr (An key: &quot;value&quot;\n value (An tens 
or {\n dtype: DT INT321n tensor shape (An 

dim (An size: 2\n An An ten 


sor content: &guot; \\240P\\000\\000\\001\\000\\000\\000&quot ; \n 
An Jn }\n}\nnode {\n name: &quot;gradients/mse_grad/ 
Tile&quot;\n op: &quot;Tile&quot;\n input: &quot;gradients/mse 
_grad/Reshape&quot;\n input: &quot;gradients/mse_grad/Tile/mult 
iples&quot;\n attr {\n key: &quot; T&quot; n value {\n 
type: DT_FLOAT\n An Jin attr (in key: &guot: Tmultiple 


s&guot:Nn value (An type: DT INT32Nn Jn }\n}\nnode 
(Ain name: &quot;gradients/mse_grad/Shape&quot;\n op: &quot;Con 
st&quot;\n attr {\n key: &quot;dtype&quot; \n value {\n 
type: DT_INT32\n An Jin attr {\n key: &quot;value&qu 
ot;\n value {\n tensor {\n dtype: DT_INT32\n 
tensor_shape {\n dim {\n size: 2\n 
An An tensor_content: &quot; \\240P\\000\\000\ 
\001\\000\\000\\000&quot ; \n An An }\n}\nnode [An nam 
e: &guot:gradients/mse grad/Shape 1&guot:Nn op: &guot:Const&guo 
tin attr (An key: &quot;dtype&quot;\n value (An typ 
e: DT INT32Nn An }\n attr An key: &quot;value&quot;\n 
value (An tensor (An dtype: DT_INT32\n ten 
sor shape (An dim (An An An An 
Jn }\n}\nnode {\n name: &quot;gradients/mse_grad/Const&quo 
t;\n op: &quot;Const&quot;\n attr (in key: &quot;dtype&quot 


NN value {\n type: DT_INT32\n An Jin attr An 
key: &quot;value&quot;\n value (An tensor (An dty 
pe: DT INT32Nn tensor shape (An dim (An 

size: 1\n An An int val: ONn TN 


n Jn }\n}\nnode {\n name: &guot:gradients/mse grad/Prod&qu 
ot;\n op: &quot;Prod&quot;\n input: &guot:gradients/mse grad/S 
hape&quot;\n input: &quot;gradients/mse_grad/Const&quot;\n att 
r {\n key: &quot; T&quot;\n value {\n type: DT_INT32\n 
An Jin attr An key: &quot;Tidx&quot;\n value {\n 
type: DT_INT32\n An Jin attr An key: &quot;keep_di 
ms&quot ; n value {\n b: falsein An  }\n}\nnode {\n 
name: &quot;gradients/mse_grad/Const_1&quot;\n op: &guot:Const& 
quot;\n attr {\n key: &quot;dtype&quot; n value {\n 
type: DT_INT32\n An Jin attr An key: &quot;value&quot; 


\n value {\n tensor {\n dtype: DT_INT32\n 
tensor_shape {\n dim {\n size: ln 
An An int val: ONn An Jn }\n}\nnode { 


\n name: &quot;gradients/mse_grad/Prod_1&quot;\n op: &quot;Pro 
d&quot;\n input: &quot;gradients/mse_grad/Shape_1&quot;\n inpu 
t: &quot;gradients/mse_grad/Const_1&quot;\n attr (An key: &q 
uot; T&quot; n value {\n type: DT_INT32\n An Jin at 
tr {\n key: &quot;Tidx&quot;\n value {\n type: DT_INT 
32\n An Jin attr An key: &quot;keep_dims&quot;\n va 
lue {\n b: false\n Jn MNnMnnode {\n name: &quot;grad 
ients/mse grad/Maximum/y&guot:Nn op: &quot;Const&quot;\n attr 


{\n key: &quot;dtype&quot; \n value {\n type: DT_INT32 
\n An Jin attr (in key: &quot;value&quot; \n value {\ 
n tensor {\n dtype: DT_INT32\n tensor_shape { 
\n An int val: lin An Jn }\n}\nnode AN 
n name: &quot;gradients/mse_grad/Maximum&quot;\n op: &quot;Max 
imum&quot;\n input: &guot:gradients/mse grad/Prod 1&guot:Nn in 
put: &quot;gradients/mse_grad/Maximum/y&quot;\n attr (An key 
: &guot:Teguot:Nn value (An type: DT INT32Nn An Jn 
}\nnode {\n name: &quot;gradients/mse_grad/floordiv&quot;\n op 
: &quot;FloorDiv&quot;\n input: &quot;gradients/mse_grad/Prod&q 
uot;\n input: &quot;gradients/mse_grad/Maximum&quot;\n attr AN 
n key: &quot; T&quot;\n value {\n type: DT_INT32\n 
An }\n}\nnode {\n name: &quot;gradients/mse_grad/Cast&quot; n 
op: &quot;Cast&quot;\n input: &quot;gradients/mse_grad/floord 
iv&quot;\n attr {\n key: &quot;DstT&quot; n value {\n 
type: DT_FLOAT\n An Jin attr (in key: &quot;SrcT&quot 
;\n value {\n type: DT_INT32\n Jn }\n}\nnode {\n n 
ame: &quot;gradients/mse_grad/truediv&quot;\n op: &quot;RealDiv 
&quot;\n input: &quot;gradients/mse_grad/Tile&quot;\n input: & 
quot;gradients/mse_grad/Cast&quot;\n attr {\n key: &quot;T&q 
uot; \n value {\n type: DT_FLOAT\n An  }\n}\nnode {\n 
name: &quot;gradients/Square_grad/mul/x&quot;\n op: &quot;Con 
st&quot;\n input: &guot;Agradients/mse grad/truediv&guot; n at 


tr {\n key: &quot;dtype&quot; \n value {\n type: DT_FL 
OAT\n TN }\n attr {\n key: &quot;value&quot;\n value 

{\n tensor {\n dtype: DT_FLOATAn tensor_shap 
e {\n An float val: 2.01n }\n An }\n}\n 


node {\n name: &quot;gradients/Square_grad/mul&quot;\n op: &qu 
ot;Mul&quot;\n input: &quot;gradients/Square_grad/mul/x&quot; \n 
input: &quot;sub&quot;\n attr {\n key: &quot; T&quot; n 

value {\n type: DT_FLOAT\n An }\n}\nnode {\n name: &g 
uot; gradients/Square_grad/mul_1&quot;\n op: &quot;Mul&quot;\n 

input: &quot;gradients/mse_grad/truediv&quot;\n input: &quot;gr 
adients/Square_grad/mul&quot;\n attr {\n key: &quot;T&quot;\ 
n value {\n type: DT_FLOAT\n An }\n}\nnode [An nam 
e: &guot:gradients/sub grad/Shape&guot:Nn op: &quot;Const&quot; 


\n attr {\n key: &quot;dtype&quot; \n value {\n type: 
DT_INT32\n An Jin attr An key: &quot;value&quot; \n 
value {\n tensor {\n dtype: DT_INT32\n tenso 


r_shape {\n dim {\n size: 2\n An 


An tensor_content: &quot; \\240P\\000\\000\\001\\00 


0\\000\\000&quot ; \n An Jn }\n}\nnode {\n name: &quot 
;gradients/sub_grad/Shape_1&quot;\n op: &quot;Const&quot;\n at 
tr {\n key: &quot;dtype&quot;\n value {\n type: DT_IN 
T32\n IN }\n attr {\n key: &quot;value&quot;\n value 
{\n tensor {\n dtype: DT_INT32\n tensor_shap 
e {\n dim {\n size: 2\n An 
An tensor_content: &quot; \\240P\\000\\000\\001\\000\\000 
\\000&quot ; \n }\n An }\n}\nnode {\n name: &guot:gradi 


ents/sub_grad/BroadcastGradientArgs&quot;\n op: &quot;Broadcast 
GradientArgs&quot;\n input: &quot;gradients/sub_grad/Shape&quot 
¿An input: &guot;gradients/sub grad/Shape 1&quot;\n attr (in 
key: &quot; T&quot; n value {\n type: DT_INT32\n An 
}\n}\nnode (Ain name: &quot;gradients/sub_grad/Sum&quot;\n op 
: &quot;Sum&quot;\n input: &guot:gradients/Sguare grad/mul 1&gu 
ot;\n input: &guot:gradients/sub grad/BroadcastGradientArgs&guo 
t;\n attr {\n key: &quot; T&quot; \n value {\n type: D 
T_FLOAT\n An Jin attr An key: &quot; Tidx&quot;\n va 
lue {\n type: DT_INT32\n An }\n attr An key: &quo 
t;keep_dims&quot;\n value {\n b: false\n An Jinjinn 
ode (Ain name: &quot;gradients/sub_grad/Reshape&quot;\n op: &qu 
ot;Reshape&quot;\n input: &quot;gradients/sub_grad/Sum&quot; n 
input: &quot;gradients/sub_grad/Shape&quot;\n attr {\n key: 
&quot; T&quot ; \n value (An type: DT FLOATNn An Jin 
attr {\n key: &quot;Tshape&quot;\n value (An type: D 
T_INT32\n Jn }\n}\nnode {\n name: &quot;gradients/sub_grad 
/Sum_1&quot;\n op: &quot;Sum&quot;\n input: &quot;gradients/Sq 
uare grad/mul 1&guot:Nn input: &guot:gradients/sub grad/Broadca 
stGradientArgs:1&quot;\n attr (Ain key: &quot; T&quot;\n va 


lue {\n type: DT_FLOAT\n An Jin attr An key: &quo 
t;Tidx&quot;\n value {\n type: DT_INT32\n An Jin a 
ttr (An key: &guot: keep dims&guot:Nn value (An b: fal 


sein Jn }\n}\nnode {\n name: &quot;gradients/sub_grad/Neg& 
quot;\n op: &quot;Neg&quot;\n input: &quot;gradients/sub_grad/ 
Sum_1&quot;\n attr {\n key: &quot; T&quot; n value {\n 
type: DT_FLOAT\n Jn }\n}\nnode fin name: &quot;gradients 
/sub grad/Reshape 1&quot;\n op: &quot;Reshape&quot;\n input: € 
quot; gradients/sub_grad/Neg&quot;\n input: &quot;gradients/sub_ 
grad/Shape_1&quot;\n attr {\n key: &quot; T&quot;\n value 
{\n type: DT_FLOAT\n An Jin attr An key: &quot;Ts 


hape&quot ; \n value (An type: DT_INT32\n }\n }\n}\nno 
de fin name: &quot;gradients/sub grad/tuple/group deps&quot;\n 
op: &quot;NoOp&quot;\n input: &guot:/gradients/sub grad/Reshap 
e&quot;\n input: &quot;/gradients/sub_grad/Reshape_1&quot; \n}\n 
node {\n name: &quot;gradients/sub_grad/tuple/control_dependenc 
y&quot;\n op: &quot;Identity&quot;\n input: &quot;gradients/su 
b_grad/Reshape&quot;\n input: &guot;Agradients/sub grad/tuple/g 
roup_deps&quot;\n attr {\n key: &quot; T&quot; n value {\n 
type: DT_FLOAT\n TNA }\n attr {\n key: &quot;_clas 
s&guot:Nn value (An list (in s: &quot;loc:@gradie 
nts/sub_grad/Reshape&quot; \n An An }\n}\nnode [An na 
me: &guot:gradients/sub grad/tuple/control dependency 1&guot:Nn 
op: &quot;Identity&quot;\n input: &guot:gradients/sub grad/Res 
hape 1&guot:Nn input: &guot;Agradients/sub grad/tuple/group dep 


s&quot;\n attr {\n key: &quot; T&quot;\n value {\n ty 
pe: DT_FLOAT\n An Jin attr (Ain key: &quot; class&quot;\ 
n value {\n list {\n s: &quot; loc: @gradients/sub_ 
grad/Reshape_1&quot;\n An Jn }\n}\nnode {\n name: &g 


uot;gradients/predictions grad/MatMul&quot;\n op: &quot;MatMul& 
quot;\n input: &quot;gradients/sub_grad/tuple/control_dependenc 
y&quot;\n input: &quot;theta/read&quot;\n attr {\n key: &qu 
ot; T&quot;\n value {\n type: DT_FLOAT\n An Jin att 
r {\n key: &quot;transpose_a&quot;\n value {\n b: fal 
se\n INA attr {\n key: &quot;transpose_b&quot;\n 
value {\n b: true\n An }\n}\nnode {\n name: &guot:gra 
dients/predictions grad/MatMul 1&guot:Nn op: &quot;MatMul&quot; 
\n input: &quot;X&quot;\n input: &guot:gradients/sub grad/tupl 
e/control_dependency&quot;\n attr (An key: &quot; T&quot;\n 
value {\n type: DT_FLOAT\n An Jin attr (An key: 
&quot; transpose_a&quot;\n value {\n b: true\n An ON 
n attr {\n key: &guot;transpose b&quot;\n value {\n 
b: false\n Jn }\n}\nnode {\n name: &quot;gradients/predict 
ions_grad/tuple/group_deps&quot;\n op: &quot;NoOp&quot;\n inpu 
t: &guot:/gradients/predictions grad/MatMul&guot:Nn input: &quo 
t;Agradients/predictions_grad/MatMul_i&quot;\n}\nnode {\n name: 
&guot:gradients/predictions grad/tuple/control dependencyeguot: 
\n op: &quot;Identity&quot;\n input: &quot;gradients/predictio 
ns_grad/MatMul&quot;\n input: &guot;Agradients/predictions grad 
/tuple/group deps&quot;\n attr (Ain key: &quot;T&quot;\n vV 
alue {\n type: DT FLOATNn An Jin attr An key: &gu 


ot; class&guot; n value (An list {\n s: &quot; loc 
:Ogradients/predictions_grad/MatMulgquot;An An An Jn 
}\nnode {\n name: &guot:gradients/predictions grad/tuple/contro 
1_dependency_1&quot;\n op: &quot;Identity&quot;\n input: &guot 
:gradients/predictions grad/MatMul 1&quot;\n input: &guot;Agrad 
ients/predictions grad/tuple/group deps&guot:Nn attr {\n key 


: &quot; T&quot; \n value {\n type: DT_FLOAT\n An An 
attr {\n key: &quot; class&quot;\n value {\n list { 
\n s: &guot:loc:@gradients/predictions grad/MatMul 1&guot 
;\n An Jn  }\n}\nnode fin name: &guot:GradientDescent 
/learning_rate&quot;\n op: &quot;Const&quot;\n attr (in key 
: &quot;dtype&quot;\n value {\n type: DT_FLOAT\n An 
An attr {\n key: &quot; value&quot; \n value {\n ten 
sor {\n dtype: DT_FLOAT\n tensor_shape {\n 
An float val: 0.0099999997764825821n An An EN 


n}\nnode ¿An name: &guot:GradientDescent/update theta/ApplyGrad 
ientDescent&quot;\n op: &quot;ApplyGradientDescent&quot;\n inp 
ut: &quot;theta&quot;\n input: &quot;GradientDescent/learning_r 
ate&quot;\n input: &quot;gradients/predictions_grad/tuple/contr 
ol dependency 1&guot:Nn attr {\n key: &quot;T&quot;\n val 


ue {\n type: DT_FLOATAn n }\n attr {\n key: &quot 
;_class&quot;\n value (An list (Ain s: &quot;loc:@ 
theta&quot;\n An An Jin attr (An key: &guot:use 1 
ocking&quot ; \n value {\n b: false\n An }\n}\nnode { 


\n name: &quot;GradientDescent&quot;\n op: &quot;NoOp&quot; n 
input: &guot;AGradientDescent/update theta/ApplyGradientDescent 
&quot;\n}\nnode {\n name: &quot;init&quot;\n op: &quot;NoOp&qu 
ot;\n input: &quot;4theta/Assign&quot;\n}\nnode fin name: &quo 
t;save/Const&quot;\n op: &quot;Const&quot;\n attr {\n key: 


&quot;dtype&quot; \n value {\n type: DT_STRING\n }\n 
Jn attr An key: &quot;value&quot;\n value (An tens 
or {\n dtype: DT STRINGNn tensor shape (An 

An string val: &quot;model&quot; \n An An }\n} 


\nnode {\n name: &guot:save/SaveV2/tensor names&guot:Nn op: &g 
uot;Const&quot;\n attr {\n key: &quot;dtype&quot; \n value 


{\n type: DT_STRING\n An Jin attr (in key: &guot: 
value&quot ; n value {\n tensor {\n dtype: DT STRI 
NG\n tensor_shape {\n dim {\n size: 1 
\n An An string val: &quot;theta&quot;\ 


n An Jn MNnMnnode fin name: &quot;save/SaveV2/shape 


_and_slices&quot;\n op: &quot;Const&quot;\n attr (An key: & 


quot; dtype&quot ; \n value {\n type: DT STRINGNn AN } 
n attr An key: &quot;value&quot; \n value {\n tenso 
r {\n dtype: DT STRINGNn tensor shape (An 

dim (An size: 1\n An An str 
ing_val: &quot;&quot;\n An Jn }\n}\nnode fn name: € 


quot ;save/SaveV2&quot;\n op: &quot;SaveV2&quot;\n input: &quot 
;Save/Const&quot; \n input: &quot;save/SaveV2/tensor_names&quot ; 
\n input: &quot;save/SaveV2/shape and slices&quot;\n input: &q 
uot ;theta&quot;\n attr (An key: &quot;dtypes&quot ;\n valu 
e {\n list {\n type: DT FLOATNn An An Jn 
}\nnode {\n name: &guot:save/control dependency&quot;\n op: &g 
uot;Identity&quot;\n input: &quot;save/Const&quot;\n input: &g 
uot; \save/Savev2&quot;\n attr (in key: &quot; T&quot;\n va 


lue {\n type: DT_STRING\n An Mn attr (in key: &qu 
ot; _class&quot;\n value {\n list {\n s: &quot; loc 
:Osave/Constéquot;An An An }\n}\nnode [An name: &guo 
t;save/RestoreV2/tensor_names&quot;\n op: &quot;Const&quot;\n 
attr {\n key: &quot;dtype&quot; \n value {\n type: DT_ 
STRING\n An Jin attr (An key: &quot;value&quot;\n va 
lue {\n tensor {\n dtype: DT_STRING\n tensor_ 
shape {\n dim {\n size: 1\n An 

}\n string_val: &quot;theta&quot;\n An An 


}\n}\nnode {\n name: &duot;save/RestoreV2/shape and slices&guot 
¿An op: &quot;Const&quot;\n attr (An key: &quot;dtype&quot; 


\n value {\n type: DT_STRINGAn An }\n attr (An 
key: &quot;value&quot;\n value {\n tensor {\n dty 
pe: DT STRINGNn tensor shape (An dim (An 

size: 1\n An An string val: &guot:& 
guot; n An An }\n}\nnode {\n name: &guot: save/Restor 


eV2&quot;\n op: &quot;RestoreV2&quot;\n input: &guot: save/Cons 
t&quot;\n input: &quot;save/RestoreV2/tensor_names&quot;\n inp 
ut: &quot;save/RestoreV2/shape_and_slices&quot;\n attr {\n k 
ey: &quot;dtypes&quot ; n value {\n list {\n type: 

DT_FLOAT\n An An }\n}\nnode {\n name: &quot;save/As 
Sign&quot;\n op: &quot;Assign&quot;\n input: &quot;theta&quot; 
\n input: &quot;save/RestoreV2&quot;\n attr (An key: &quot; 
T&quot; \n value {\n type: DT_FLOAT\n An Jin attr { 
\n key: &quot; class&quot;\n value {\n list {\n 

s: &quot; loc: @theta&quot;\n An An Jin attr (An 


key: &quot;use_locking&quot;\n value {\n b: true\n TAN 
n Mn attr An key: &quot;validate shape&quot;\n value { 
\n b: true\n An JAnjinnode {\n name: &guot:save/resto 
re_all&quot;\n op: &quot;NoOp&quot;\n input: &quot;/save/Assig 
n&quot;\n}\n'; 
y 
</script> 
<link rel=&quot;import&quot; href=&quot;https://tensorboard. 
appspot.com/tf-graph-basic.build.html&quot; onload=-load()> 
<div style=&quot;heïight:600px&quot ; > 
<tf-graph-basic id-&guot: graph0.1784179106002547&guot ; ></t 
f-graph-basic> 
</div> 
"></iframe> 


Visualization - using TensorBoard 


e Start TensorBoard: $ tensorboard --logdir tf_logs/ (starts on localhost:6006) 


tf.reset default graph() 





from datetime import datetime 
now = datetime.utcnow().strftime("%Y%m%d%H%M%S" ) 


root_logdir = "tf_logs" 
logdir = "{}/run-{}/".format(root_logdir, now) 


n_epochs = 1000 
learning rate = 0.01 


X = tf.placeholder ( 
tf.float32, 
shape=(None, n + 1), 
name="X") 


y = tf.placeholder( 


tf.float32, 
shape=(None, 1), 
name="y") 


theta = tf.Variable( 
tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), 


name="theta") 


y_pred = tf.matmul( 
X, theta, name="predictions") 


error = y_pred - y 
mse = tf.reduce_mean( 
tf.square(error), 


name="mse") 


optimizer = tf.train.GradientDescentOptimizer ( 
learning_rate=learning_rate) 


training_op = optimizer.minimize(mse) 


init = tf.global_variables_initializer() 


mse_summary = tf.summary.scalar('MSE', mse) 


# Filewriter - creates logdir if not already present, 
# then writes graph def to a binary logfile. 


summary writer = tf.summary.Filewriter ( 
logdir, 
tf.get default graph()) 


n epochs - 10 
batch size - 100 
n batches - int(np.ceil(m / batch size)) 


with tf.Session() as sess: 
sess.run(init) 


for epoch in range(n epochs): 
for batch index in range(n batches): 


X batch, y batch - fetch batch( 
epoch, 
batch index, 
batch size) 


# evaluate mse summary on periodic basis, 
# eg every 10 minibatches. 
# adds summary for addition to events file. 


if batch index % 10 == 
summary str = mse summary.eval( 
feed dict-fX: X batch, y: y batch!) 


step - epoch * n batches + batch index 


summary writer.add summary ( 
summary str, 
step) 


sess.run( 
training op, 
feed dict-fX: X batch, y: y batch!) 


best theta = theta.eval() 


summary writer.flush() 
summary writer .close() 
print ("Best theta:") 
print (best theta) 


Best theta: 
IL 2.07001591] 
[ 0.82045609] 
0.1173173 | 
0.22739051] 
0.31134021] 
0.00353193] 
0.01126994] 
0.91643935] 
0.87950081 ] ] 


Name Scopes 


e Graphs can contain thousands of nodes. name scopes group related nodes 
to aid visualization. 


tf.reset default graph() 


now = datetime.utcnow().strftime("%Y%m%d%H%M%S"' ) 
root_logdir = "tf_logs" 
logdir = "{}/run-{}/".format(root_logdir, now) 


n_epochs = 1000 
learning rate = 0.01 


X = tf.placeholder ( 
tf.float32, 
shape=(None, n + 1), 
name="X") 


y = tf.placeholder( 
tf.float32, 
shape=(None, 1), 
name="y") 


theta - tf.Variable( 
tf.random uniform( (In + 1, 1], -1.0, 1.0, seed=42), 
name-"theta") 


y_pred = tf.matmul( 
X, theta, 
name="predictions") 





## Name Scope 
with tf.name scope('loss') as scope: 
error = yopred y 
mse = tf.reduce_mean(tf.square(error), name="mse" ) 


HHHHH 
HHHHH 


optimizer = tf.train.GradientDescentOptimizer ( 
learning_rate=learning_rate) 


training_op = optimizer.minimize(mse) 


init = tf.global_variables_initializer() 


mse summary = tf.summary.scalar ( 
'MSE', mse) 


summary writer = tf.summary.Filewriter ( 
logdir, tf.get default graph()) 


n epochs - 10 
batch size - 100 
n batches - int(np.ceil(m / batch size)) 


with tf.Session() as sess: 
sess.run(init) 


for epoch in range(n epochs): 
for batch index in range(n batches): 


X batch, y batch - fetch batch( 
epoch, 
batch index, 
batch size) 


if batch index % 10 == 


summary str = mse summary.eval( 
feed dict-fX: X batch, y: y batch!) 


step = epoch * n batches + batch index 


summary writer.add summary ( 
summary. str, 
step) 


sess.run( 
training op, 
feed dict-fX: X batch, y: y batch)) 


best theta - theta.eval() 


summary writer.flush() 
summary writer.close() 
print("Best theta:") 
print(best theta) 


Best theta: 
IL 2.07001591] 
[ 0.82045609] 
0.1173173 | 
0.22739051] 
0.31134021] 
0.00353193] 
0.01126994] 
0.91643935] 
0.87950081 ] ] 


In TensorBoard: 


theta g radients Gradient... 


2x9 


La, 


predictions y 


1x9 


Modularity 


e ex: create graph, adds two ReLU nodes 
e output: result if >0, O otherwise 


# UGLY 


tf.reset default graph() 


n features - 3 

X = tf.placeholder( 
tf.float32, 
shape=(None, n_features), 
name="X") 


w1 = tf.Variable( 
tf.random_normal( 
(n_features, 1)), 
name="weights1") 


w2 = tf.Variable( 
tf.random_normal( 
(n_features, 1)), 
name="weights2") 


b1 = tf.Variable( 

0.0, name="bias1") 
b2 = tf.Variable( 

0.0, name="bias2") 


linear1 = tf.add( 


tf.matmul(X, w1), b1, name="linear1") 


linear2 = tf.add( 


tf.matmul(X, w2), b2, name="linear2") 


relu1 = tf.maximum( 
linearl, 0, name="relu1") 


relu2 - tf.maximum( 
linear2, 0, name="relu2") 


output = tf.add_n([relui, relu2], 


name="output") 


# better -- you can create functions that build ReLUs! 


tf.reset default graph() 


def relu(X): 
w_shape = int( 
X.get_shape()[1]), 1 


w = tf.Variable( 
tf.random normal(w shape), 
name-"weights") 


b - tf.Variable( 
0.0, 
name="bias") 


linear = tf.add( 
tf.matmul(X, w), 
b, 
name="linear") 


return tf.maximum(linear, 0, name-"relu") 


n_features = 3 


X = tf.placeholder ( 
tf.float32, 
shape=(None, n_features), 
name="X") 


relus = [relu(X) for i in range(5)] 


output = tf.add_n( 
relus, 
name-"output") 


summary writer - tf.summary.Filewriter ( 
“Logs/reluL™, 
tf.get default graph()) 


Sharing Variables 


e Simplest option: define it first, then share it as parameter to all functions that 
need it. 


# better, with name scopes 


tf.reset default graph() 


def relu(X): 
with tf.name_scope("relu"): 


w_shape = int( 
X.get_shape()[1]), 1 


w = tf.Variable( 
tf.random normal(w shape), name-"weights") 


o 
Il 


tf.Variable( 
0.0, name="bias") 


linear = tf.add( 
tf.matmul(X, w), b, name="linear") 


return tf.maximum( 
linear, 0, name="max") 


n features - 3 

X = tf.placeholder( 
Li rloat32, 
shape=(None, n features), 
name="X") 


relus = [relu(X) for i in range(5)] 


output = tf.add_n( 
relus, name="output") 


summary writer = tf.summary.Filewriter ( 
"logs/relu2", 


tf.get default graph()) 


summary writer.close() 


Ils logs 


reluí relu2 relu6 


tf.reset default graph() 


def relu(X, threshold): 
with tf.name scope("relu"): 
w Shape - int(X.get_shape()[1]), 1 
w = tf.Variable(tf.random normal(w shape), name="weights" 


b = tf.Variable(0.0, name="bias") 
linear = tf.add(tf.matmul(X, w), b, name="linear") 
return tf.maximum(linear, threshold, name="max" >) 


threshold - tf.Variable(0.0, name-"threshold") 

X = tf.placeholder(tf.float32, shape=(None, n features), name="X" 
) 

relus = [relu(X, threshold) for i in range(5)] 

output = tf.add_n(relus, name="output") 


AAA e) 


tf.reset default graph() 


def relu(X): 
with tf.name_scope("relu"): 
if not hasattr(relu, "threshold"): 
relu.threshold = tf.Variable(0.0, name-"threshold") 
w_shape = int(X.get_shape()[1]), 1 
w = tf.Variable(tf.random normal(w shape), name="weights" 


b = tf.Variable(0.0, name="bias") 
linear = tf.add(tf.matmul(X, w), b, name="linear") 
return tf.maximum(linear, relu.threshold, name-"max") 


X = tf.placeholder(tf.float32, shape=(None, n features), name="X" 
) 


relus = [relu(X) for i in range(5)] 
output = tf.add n(relus, name="output") 


A |) 


tf.reset default graph() 


def relu(X): 
with tf.variable_scope("relu", reuse=True): 
threshold = tf.get variable("threshold", shape=(), initi 
alizer=tf.constant_initializer(0.0)) 
w_shape = int(X.get_shape()[1]), 1 
w = tf.Variable(tf.random normal(w shape), name="weights" 


b = tf.Variable(0.0, name="bias") 
linear = tf.add(tf.matmul(X, w), b, name="linear") 
return tf.maximum(linear, threshold, name="max" ) 


X = tf.placeholder(tf.float32, shape=(None, n features), name="X" 
) 
with tf.variable scope("relu"): 

threshold = tf.get variable("threshold", shape=(), initializ 
er=tf.constant_initializer(0.0)) 
relus = [relu(X) for i in range(5)] 
output = tf.add n(relus, name="output") 


summary writer = tf.summary.FileWriter("logs/relu6", tf.get defa 
ult graph()) 
summary writer.close() 


AAA E | 


Intro - Perceptrons 


e Simplest ANN architecture 

e Uses linear threshold unit (LTU) - returns weight sum of inputs, applies step 
function to sum, outputs result 

e Single LTU can be used for simple linear binary classification 

e Perceptron = single layer of LTUs, each one connected to all inputs 

e Percepton training based on Hebb's Rule. (basically, connection weight 
between two neurons goes up when they have same output.) 

e Linear decision boundary, so Perceptrons not capable of learning complex 
patterns. 


import numpy as np 


from sklearn.datasets import load iris 
iris = load iris() 


X = iris.data[:, (2, 3)] + petal length, petal width 


y = (iris.target == 0).astype(np. int) 
from sklearn.linear model import Perceptron 


per clf = Perceptron( random state=12) 
per clf.fit(X, y) 


y_pred = per_clf.predict([[2, 0.511) 
print(y pred) 


[1] 


e Perceptron learning algo very similar go SGD. 
e Perceptrons do provide class probability (like Logistic Regression classsifier). 


They simply make predictions based on hard threshold. 
e Some limitations can be eliminated with stacked Perceptrons. 


import matplotlib.pyplot as plt 


-per_clf.coef_[0][0] / per_clf.coef_[0][1] 
-per_clf.intercept_ / per_clf.coef_[0][1] 


OS 9 
Il 


axes = [0, 5, 0, 2] 


x0, x1 = np.meshgrid( 
np.linspace(axes[0], axes[1], 500).reshape(-1, 1), 
np.linspace(axes[2], axes[3], 200).reshape(-1, 1), 


X new = np.c_[ 
x0.ravel(), 
x1.ravel()] 


y_predict = per_clf.predict(X_new) 


ZZ = y_predict.reshape(x0.shape) 


plt.figure(figsize-(10, 4)) 
plt.plot(X[y==0, 0], X[y==0, 1], "bs", label="Not Iris-Setosa") 
plt.plot(X[y==1, 0], X[y==1, 1], "yo", label-"Iris-Setosa") 


plt.plot( 
[axes[0], 
axes[1]], 
[a * axes[0] + b, 
a * axes[1] + b], 
"k-", linewidth=3) 


from matplotlib.colors import ListedColormap 
custom cmap = ListedColormap(['+9898ff', '#fafab0']) 


plt.contourf(x0, x1, zz, cmap-custom cmap, linewidth=5) 
plt.xlabel("Petal length", fontsize=14) 
plt.ylabel("Petal width", fontsize=14) 


ch10 neural nets.md 


plt.legend(loc="lower right", fontsize=14) 
plt.axis(axes) 


#save fig("perceptron iris plot") 
plt.show() 


Petal width 


m Not Iris-Setosa 
» Iris-Setosa 


0 1 2 3 4 5 
Petal length 





MLPs and Backpropagation 


e MLP contains one input layer, at least one hidden layer (LTU based), and one 
output layer (LTU based). 
e Backpropagation intro'd in 1986 paper. Can be described as Gradient 
Descent using reverse-mode autodiff. 
e For each training instance: 
o Find output of each node in each consecutive layer (forward pass). 
o Measure output error & how much each node in last hidden layer 
contributed to it. 
o Measure how much each node in previous hidden layer contributed to 
this hidden layer. 
o Repeat until input layer is reached (backward pass). 
o Adjust each connection weight to reduce the error. 
e To make algorithm work, MLP architecture changed to use logistic function 
delta(z) = 1/(1+exp(-z)) instead of step function. 
e Backpropagation can also use hyperbolic tangent or ReLU functions if 
desired. 
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# Activation functions 


def logit(z): 
return 1 / (1 + np.exp(-z)) 


def relu(z): 
return np.maximum(0, zZ) 


def derivative(f, z, eps=0.000001): 
return (f(z + eps) - f(z - eps))/(2 * eps) 
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z = np.linspace(-5, 5, 200) 


plt.figure(figsize=(11,4)) 


plt.subplot(121) 

plt.plot(z, np.sign(z), "r-", linewidth=2, label="Step") 
plt.plot(z, logit(z), "g--", linewidth=2, label="Logit") 
plt.plot(z, np.tanh(z), "b-", linewidth=2, label="Tanh") 
plt.plot(z, relu(z), "m-.", linewidth=2, label-"ReLU") 
plt.grid(True) 

plt.legend(loc="center right", fontsize=14) 
plt.title("Activation functions", fontsize=14) 

DITS ESS -12 1-21) 


plt.subplot (122) 


plt.plot(z, derivative(np.sign, z), "r-", linewidth=2, label="St 


ep" ) 
plt.plot(0, 0, "ro", markersize=5) 
plt.plot(0, 0, "rx", markersize=10) 


plt.plot(z, derivative(logit, z), "g--", linewidth=2, label="Log 


it") 


plt.plot(z, derivative(np.tanh, z), "b-", linewidth=2, label="Ta 


nh") 


plt.plot(z, derivative(relu, z), "m-.", linewidth=2, label="ReLU" 


) 

plt.grid(True) 

#plt.legend(loc-"center right", fontsize-14) 
plt.title("Derivatives", fontsize=14) 
plt.axis([-5, 5, -0.2, 1.21) 


#save_fig("activation_functions_plot") 
plt.show() 


4 A y 
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Activation functions z Derivatives 


—0.5 





# activation functions, continued 
def heaviside(z): 
return (z >= 0).astype(z.dtype) 


def sigmoid(z): 
return 1/(1+np.exp(-Z)) 


def mlp_xor(x1, x2, activation-heaviside): 
return activation( 
-activation(x1 + x2 - 1.5) + activation(x1 + x2 - 0.5) - 
0.5) 


EA y 
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x1s = np.linspace(-0.2, 1.2, 100) 

x2s = np.linspace(-0.2, 1.2, 100) 

X1, x2 = np.meshgrid(x1s, x2s) 

z1 = mlp_xor(x1, x2, activation=heaviside) 
z2 = mlp_xor(x1, x2, activation=sigmoid) 


plt.figure(figsize=(10,4)) 


plt.subplot(121) 
plt.contourf(x1, x2, z1) 
pitaplot([0, 1], 10, 1], gs”, 
plt-plor ol O 
plt.title("Activation function: 
plt.grid(True) 


plt.subplot(122) 
plt.contourf(x1, x2, z2) 
Pit-plor([0 1), fe, 4]. "gs", 
plrsplott for 1), fi Ol, "ve"; 
plt.title("Activation function: 
plt.grid(True) 

plt.show() 


Activation function: heaviside 





MLP Training 


markersize=20) 
markersize=20) 
heaviside", fontsize=14) 


markersize=20) 
markersize=20) 
sigmoid", fontsize-14) 


Activation function: sigmoid 





e MLP often used for classification - each output corresponding to distinct 


binary class (ex: urgent/not-urgent, spam/not-spam, ...) 
e If exclusive classes, output layer often uses shared softmax function. 


DNN Training with "plain" TF 


e Use mini-batch gradient descent on MNIST dataset 
e Specify #inputs, #outputs, #hidden neurons in each layer 


import tensorflow as tf 


tf.reset default graph() 
n inputs = 28*28 # MNIST 
n hidden1 300 
n hidden2 100 
n_outputs = 10 


learning_rate = 0.01 


# placeholders for training data & targets 
EEN ES nr \ nar 1 2 Vr AnFinod A A En alrnnuin actranrac n Ftraini 
EEA YEON il y partia Lal y defined due to unknown #instances in trainl 


tf.placeholder(tf.float32, shape=(None, n_inputs), name-"X") 
y = tf.placeholder(tf.int64, shape=(None), name-"y") 
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# now need to create two hidden layers t one output layer 


No need to define your own. TF shortcuts: 
fully connected() 


def neuron layer(X, n neurons, name, activation=None): 


# define a name scope to aid readability 
with tf.name scope(name ): 


n inputs = int(X.get_shape()[1]) 


# create weights matrix. 2D (#inputs, #neurons) 

# randomly initialized w/ truncated Gaussian, stdev - 2/ 
sgrt(#inputs) 

# aids convergence speed 


stddev - 1 / np.sgrt(n inputs) 

init = tf.truncated normal((n inputs, n neurons), stddev 
zstddev) 

W = tf.Variable(init, name-"weights") 


# create bias variable, initialized to zero, one param p 
er neuron 
b 


tf.Variable(tf.zeros( In neurons]), name="biases") 


#Z = X dot W +b 
Z = tf.matmul(X, W) + b 


# return relu(z), or simply z 
if activation=="relu": 

return tf.nn.relu(Z) 
else: 

return Z 


with tf.name_scope("dnn"): 
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hidden1 = neuron layer(X, n_hidden1, "hidden1", activa 
tion-"relu") 

hidden2 - neuron layer(hidden1, n hidden2, "hidden2", activa 
tion-"relu") 


# logits - NN output before going thru softmax activation 
logits = neuron layer(hidden2, n outputs, "output") 


with tf.name_scope("loss"): 


# sparse softmax cross entropy with logits() -- TF routine, 
handles corner cases for you. 
xentropy = tf.nn.sparse softmax cross entropy with logits( 
labels-y, 
logits-logits) 


# use reduce mean() to find mean cross-entropy over all inst 
ances. 
loss - tf.reduce mean( 
xentropy, 
name="loss") 


# use GD to handle cost function, ie minimize loss 

with tf.name scope("train"): 
optimizer - tf.train.GradientDescentOptimizer(learning rate) 
training op - optimizer.minimize(loss) 


# use accuracy as performance measure. 


with tf.name_scope("eval"): # verify whether highest logi 
t corresponds to target class 
correct = tf.nn.in top k( # using in_top_k(), returns 1 
D tensor of booleans 
logits, y, 1) 


accuracy = tf.reduce_mean( # recast booleans to 
float & find avg. 
tf.cast(correct, tf.float32)) # this gives overall 


accuracy number. 


init = tf.global variables initializer() # initializer node 
| 





saver = tf.train.Saver() # to save trained par 


Execution Phase 


e Load MNIST using TF helpers (fetch, auto-scale, shuffle, provide minibatch 
function) 


load MNIST 


from tensorflow.examples.tutorials.mnist import input data 
mnist = input data.read data sets("/tmp/data/") 


n epochs = 20 
batch size - 50 


Extracting /tmp/data/train-images-idx3-ubyte.gz 
Extracting /tmp/data/train-labels-idx1-ubyte.gz 
Extracting /tmp/data/t10k-images-idx3-ubyte.gz 
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz 


# Train 
with tf.Session() as sess: 


init.run() # initialize all variables 


for epoch in range(n epochs): 
for iteration in range(mnist.train.num examples // batch 
size): 


# use next batch() to fetch data 
X batch, y batch - mnist.train.next batch(batch size 


sess.run( 
training op, 
feed dict-fX: X batch, y: y_batch}) 


acc train = accuracy.eval( 
feed dict-fX: X batch, y: y batch!) 


acc test - accuracy.eval( 
feed dict-fX: mnist.test.images, 


y: mnist.test.labels!) 


print(epoch, "Train accuracy:", acc train, "Test accurac 
y:", acc test) 


save path = saver.save(sess, "./my model final.ckpt") 
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accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
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accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
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© © © © 0 © © © © © 


© © © OF O © © OHM 


.94 Test 
.9 Test accuracy: 
.94 
.88 
.88 
.96 
9a 


Test 
Test 
Test 
Test 
Test 
Test 
Test 
Test 
Test 
.94 Test 
.98 Test 
.98 Test 
.94 Test 
.0 Test 
.96 Test 
.92 Test 
.98 Test 
.92 Test 


Using in production 


accuracy: 


accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 


accuracy: 
accuracy: 
accuracy: 
accuracy: 


accuracy: 


accuracy: 
accuracy: 
accuracy: 
accuracy: 


0.8753 
0.9087 
9212 
. 9239 
. 9335 
. 9365 
.9415 
. 9449 
. 9459 
.9497 
.9539 
0.9563 
0.958 
0.9584 
0.9614 
0.9622 

0.9639 

0.9635 

0.9654 

0.9667 
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e Now trained - you can use the NN to predict. 


with tf.Session() as sess: 


from disk 


saver.restore(sess, save path) #"my 


rah 
grab 


X new scaled - mnist.test.images[:20] 


Z = logits.eval(feed_dict={X: X new scaled!) 


print(np.argmax(Z, axis-1)) 
print(mnist.test.labels[:20]) 


(7 2104149 69069015 9 7 3 4] 
(7 2104149 59069015 9 7 3 4] 


Parameter Tuning 


e Way too many parameters - Grid Search approach not time-effective. 
e 1st option: randomized search. 

e 2nd option: Oscar 

e Start with common defaults to restrict search space. 


Number of hidden layers 


e Deep nets have much better parameter efficiency than shallow ones. (They 
can model complex functions with much fewer neurons.) 
e Largely due to hierarchical nature of most data modeling probs 


Number of neurons per hidden layer 


e Determined by input dimensions. Ex: MNIST requires 28x28 inputs, 10 
outputs 

e Try increasing + of layers before # neurons/layer. 

e Simple trick: pick model w/ excessive layers £ neurons, use early stopping, 
regularization, dropout, etc. to prevent overfit. 


Activation functions 


e Defaults: 
o use ReLU in hidden layers. Faster 8 helps avoid GD getting stuck on 
local plateaus. 
o use Softmax in output layer (for classification; none needed for 
regression.) 


Vanishing & Exploding Gradients 


e gradients get smaller as algorithm progresses to lower layers. Eventually GD 
leaves lower weights virtually unchanged. so training never converges. 

e gradients can also grow out of control (often seen in RNNs). 

e Significant paper - using combo of logistic sigmoid activiation with random 
weight initialization (normal, mean=0, stdev=1) -- output variance was >> 
input variance. 

e logistic activation: function saturates at 0 or 1 with derivative very close to 0 
==> so backpropagation has no gradient to use. 


Sigmoid activation function 


/ 


Saturating 


Saturating 
Linear 

















Xavier & He Initialization 


e For signals to flow properly in both directions, each layer's output variance 
should equal its input variance. 
e Recommends initializing connection weights with random settings using #ins, 





Activation function | Uniform distribution (-r,r) Normal distribution 


Logistic wee, ga ho 
\ Ninputs + "outputs \ Pinputs + “outputs 
Hyperbolic tangent sd a 6 ss | aa 
Y "inputs + "outputs Ninputs + Toepas 
ReLU (and its variants) yedi EE ME gek | 2 
\ "inputs + "outputs Y "inputs + “outputs 


#outs 
e Default: fully connected/() function uses Xavier initialization w/ uniform 
distribution. Change to He initialization by using variance scaling initializer() 


function 


import tensorflow as tf 
from tensorflow.contrib.layers import fully connected 


n inputs = 28*28 
n hidden1 - 300 


X = tf.placeholder(tf.float32, shape=(None, n inputs), name="X") 


he init - tf.contrib.layers.variance scaling initializer() 
hidden1 = fully connected(X, n hidden1, weights initializer-he i 
nit, scope="h1") 


Non-Saturating Activation Functions 


e ReLU activations suffer from dying ReLU problem (they stop emitting 
anything other than zero). 
e Workaround: the leaku ReLU. Alpha defines leakage; typical set to 0.01. 


Leaky ReLU activation function 
4 


Leak 











4 2 0 2 4 


e Also: randomized leaky ReLU (RReLU) (randomized alpha) 

e Also: parametric leaky RULE (PReLU) (alpha can be modified during 
backprop) 

e Also: exponential linear unit (ELU). Allows negative values when z<0; non- 
zero gradient for z<0 (avoids dying units issue); smooth function everywhere. 
Uses exponential function, so harder to compute. paper 


ELU activation function (a = 1) 




















# TF doesn't have leaky ReLU predefined, but easy to build. 


def leaky_relu(z, name-None): 
return tf.maximum(0.01 * z, z, name=name) 


hidden1 = fully_connected(X, n_hidden1, activation fn-leaky relu 


) 


Batch Normalization 


e proposed to solve vanishing/exploding gradients. 

e Idea: pror to activation function, 1) zero-center & normalize inputs 2) scale 8 
shift result with 2 new params per layer 

e Net effect: model learns optimal scale & mean of inputs for each layer 


mp 
i. p -ly,0 
B Mpi=1 
m 
B 
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3. = 3 
On te 


e Algorithm: 


e Does add computational complexity. Consider plain ELU + He initializaton as 
well. 


Batch Normalization with TF 


e batch normalization() - centers 4 normalizes inputs 


e batch_norm() - above, plus finds mean, stdev, scaling, offset params 


e call directly or include it in fully_connected() arguments 


+ Us 


e MNIST dataset again 


from tensorflow.examples.tutorials.mnist import input_data 
mnist = input data.read data sets("/tmp/data/") 


Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes 


Extracting /tmp/data/train-images-idx3-ubyte.gz 

Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. 
Extracting /tmp/data/train-labels-idx1-ubyte.gz 

Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. 
Extracting /tmp/data/t10k-images-idx3-ubyte.gz 

Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. 
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz 


#setup 


import tensorflow as tf 


from tensorflow.contrib.layers import batch norm, fully connecte 


d 


tf.reset default graph() 


n inputs = 28 * 28 
n hidden1 - 300 


n hidden2 
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n outputs = 10 


learning rate - 0.01 


def 


tf.placeholder(tf.float32, shape=(None, n inputs), name-"X") 
tf.placeholder(tf.int64, shape=(None), name="y") 


leaky relu(z, name-None): 


return tf.maximum(0.01 * z, z, name-name) 


# is training: tells batch norm() whether to use current minibat 
ch's mean & stdev 
# (found during training) or use running avgs (during testing) 


with tf.name scope("dnn"): 

hidden1 = fully connected(X, n hiddeni, activation fn-leaky 
relu, scope="hidden1") 

hidden2 = fully connected(hiddeni, n hidden2, activation fn- 
leaky relu, scope="hidden2") 

logits = fully connected(hidden2, n outputs, activation fn=N 
one, scope="outputs") 


with tf.name_scope("loss"): 

xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(la 
bels=y, logits=logits) 

loss = tf.reduce mean(xentropy, name="loss") 


with tf.name_scope("train"): 
optimizer = tf.train.GradientDescentOptimizer(learning rate) 
training_op = optimizer.minimize(loss) 


with tf.name_scope("eval"): 
correct = tf.nn.in_top_k(logits, y, 1) 
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32) ) 


init = tf.global variables initializer() 
saver - tf.train.Saver() 


n epochs - 2 
batch size - 


with tf.Sess 


init.run 


for epoc 


for 


tch}) 


acc 


tch}) 


acc 


, y: mnist.t 


prin 
y:", acc_tes 


save_pat 


0 
100 


ion() as sess: 


O 


h in range(n epochs): 


iteration in range(len(mnist.test.labels)//batch_siz 


X_batch, y_batch = mnist.train.next_batch(batch_size 


sess.run(training_op, feed dict-fX: X batch, y: y_ba 


train = accuracy.eval( feed dict-fX: X batch, y: y ba 


test = accuracy.eval( feed dict={X: mnist.test.images 
est.labels}) 


t(epoch, "Train accuracy:", acc train, "Test accurac 
t) 


h = saver.save(sess, "my model final.ckpt") 
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Train 
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Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 


accuracy: 
accuracy: 
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accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 
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.6 Test accuracy: 0.642 
.73 Test 
.81 Test 
.84 Test 
.8 Test accuracy: 
.87 Test 
.85 Test 
od est 
.86 Test 
.91 Test 
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.91 Test 
.86 Test 
.88 Test 
.87 Test 
-93 Test 
9 (TSS 
Jo esi: 
O Test 
292 Test 
9a Test 


Gradient Clipping 


to limit exploding gradients problem. Clip during backprop. 


accuracy: 
accuracy: 
accuracy: 


accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 


accuracy: 
accuracy: 
accuracy: 
accuracy: 
accuracy: 


accuracy: 


accuracy: 


accuracy: 


accuracy: 
accuracy: 


Typical use case: recurrent NNs. 


source 


Uses TF minimize() function in optimizer. 


0.7824 
0.827 
0.8539 
0.8686 
0.8759 
0.8843 
0.8903 
0.8969 
0.9018 
. 9014 
. 9065 
. 9078 
OT 
Toa 
0.9123 
0.9141 
0.9149 
0.9159 
0.9174 


© © © © © 


threshold = 1.0 


optimizer = tf.train.GradientDescentOptimizer ( 
learning_rate) 


grads_and_vars = optimizer.compute_gradients( 
loss) 


capped_gvs = [ 
(tf.clip by value( 
grad, -threshold, threshold), var) 


for grad, var in grads and vars] 


training op = optimizer.apply gradients(capped gvs) 


Pretrained Layers & Reuse 


e best practice: look for existing NN that tackles similar task, then reuse lower 
layers (aka transfer learning). 


# Reuse with TF 


Reusing Models from Other Frameworks 


e Requires manual loading of weights (ex: Theano) 
e Very tedious 


original_w = [] + Load the weights from the other framework 
original_b = [] + Load the biases from the other framework 


X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X") 
hiddeni = fully connected(X, n hiddeni, scope="hidden1") 


[...] # # Build the rest of the model 


# Get a handle on the variables created by fully connected() 
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with tf.variable scope("", default name-"", reuse-True): # root 
scope 
hidden1 weights = tf.get variable("hidden1/weights") 
hiddeni_biases = tf.get variable("hidden1/biases") 


# Create nodes to assign arbitrary values to the weights and bia 
ses 

original weights - tf.placeholder(tf.float32, shape=(n_inputs, n 
_hidden1)) 

original_biases - tf.placeholder(tf.float32, shape-(n hidden1)) 


assign hidden1 weights - tf.assign(hidden1 weights, original wei 
ghts) 
assign hidden1 biases - tf.assign(hidden1 biases, original biase 


S) 
init - tf.global variables initializer() 


with tf.Session() as sess: 
sess.run(init) 
sess.run( 
assign hidden1 weights, 
feed dict-foriginal weights: original w!) 


sess.run( 
assign hidden1 biases, 


feed dict-foriginal biases: original b!) 


[...] # Train the model on your new task 
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'Nnoriginal w = [] # Load the weights from the other framework\n 
original_b = [] + Load the biases from the other framework\n\nX 

= tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")\n 
hidden1 = fully connected(X, n_hidden1, scope="hiddeni")\n\n[... 
] # # Build the rest of the model\n\n# Get a handle on the varia 
bles created by fully connected()\n\nwith tf.variable scope("", 

default name-"", reuse=True): # root scope\n hidden1 weights 

= tf.get_variable("hiddeni/weights")\n hidden1 biases = tf.ge 
t variable("hiddenl/biases")NnNn# Create nodes to assign arbitra 
ry values to the weights and biases\noriginal_weights = tf.place 
holder(tf.float32, shape=(n_inputs, n hidden1))Nnoriginal biases 
= tf.placeholder(tf.float32, shape-(n hidden1))NnNnassign hidde 
n1 weights - tf.assign(hidden1 weights, original weights)Nnassig 
n hidden1 biases = tf.assign(hidden1 biases, original biases)\n\ 
ninit = tf.global variables initializer()NnNnwith tf.Session() a 


s sess:\n sess.run(init)\n sess.run(\n assign_hidde 
n1 weights, An feed dict-foriginal weights: original w})\ 
n \n sess.run(\n assign hidden1 biases, An 

feed dict-foriginal biases: original b})\n \n EE] 


# Train the model on your new taskin' 


Freezing Lower Layers 


e If 1st DNN already learned low-level features, try to reuse them by freezing 
the weights. 
e simplest way: 


# provide all trainable var in hidden layers 3,4 & outputs to op 
timizer function 
# (this omits vars in hidden layers 1,2) 
train vars - tf.get collection( 
tf.GraphKeys. TRAINABLE VARIABLES, 
scope="hidden[34]|outputs") 


# minimizer can't touch layers 1,2 - they're "frozen" 
training op = optimizer .minimize( 


loss, 
var_list=train_vars) 


'Nntrain vars = tf.get_collection(\n tf.GraphKeys.TRAINABLE_V 
ARIABLES, An scope="hidden[34]|outputs")\n\n# minimizer can\'t 
touch layers 1,2 - they\'re "frozen"\n\ntraining op = optimizer 
.minimize(An loss, An var_list=train_vars)\n' 


Caching Lower Layers 


e Huge speed boost! 


'* "import numpy as np 


n_epochs = 100 
n_batches = 500 


for epoch in range(n epochs): 
shuffled idx - rnd.permutation( 
len(hidden2 outputs)) 


hidden2 batches = np.array split ( 
hidden2_outputs[shuffled_idx], 
n_batches) 


y_batches = np.array_split( 
y_train[shuffled_idx], 
n_batches) 


for hidden2_batch, y_batch in zip(hidden2_batches, y batches): 
sess.run( 
Erainings op, 
feed dict-fhidden2: hidden2 batch, y: y batch!) 


"import numpy as np\n\nn_epochs = 100\nn_batches = 500\n\nfor ep 


och in range(n_epochs):An shuffled_idx = rnd.permutation(\n 
len(hidden2_outputs) )\n \n hidden2 batches = np.arra 
y_split(\n hidden2_outputs[shuffled_idx], \n n_bat 


ches)\n \ny_batches = np.array_split(\n y_train[shuffled_i 

dx], An n_batches)\n\nfor hidden2_batch, y_batch in zip(hidde 

n2_batches, y_batches):\n sess.run(\n training_op, \n 
feed dict-fhidden2: hidden2 batch, y: y_batch})\n' 


Tweaking/Dropping/Replacing Upper Layers 


e original output layer: should be replaced (little chance of reuse) 
e iterative freeze/train/compare process to see how many upper layers needed 


Model Zoos 


e When you want to find a net already trained on a similar task 
e TensorFlow Model Zoo 
e Caffe Model Zoo - converter on github 


Unsupervised pre-training 


e Tough problem, but doable. 
e Train layers one-by-one, starting with lowest layer 
e Freeze completed layers & train next layer on previous results 
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Pre-training on easily labeled data - reuse lower 
layers for "real" task 


e Often required due to cost/availability of large labeled datasets 
e Common tactic: label all training data as "good", generate & corrupt additional 
instances, label new ones as "bad. 


Faster Optimizers 


e Training speedup strategies thus far: 1) smart weight initializations, 2) smart 


activation functions, 3) batch normalization, 4) reuse of pretraining. 
e Better optimizer choices: 


o Momentum optimization 

o Nesterov Accelerated Gradients 

o AdaGrad 

o RMSProp 

o Adam (should almost always use this one) 

e Worth noting: below techniques rely on 1st-order partial derivatives 
(Jacobians); more techniques in literature use 2nd-order derivs (Hessians). 
Not viable for most deep learning due to memory 8 computational 
requirements. 


Momentum optimization 


e local gradient added to a momentum vector (m) multiplied by learning rate 
(n) 

e ie, gradient used as an accelerant - not as a speed. 

e beta hyperparameter serves as friction mechanism. 0 = high friction, 1 = no 
friction. 

e Momentum optimization escapes plateaus much faster than GD. 


optimizer = tf.train.MomentumOptimizer ( 
learning_rate=learning_rate, 
momentum=0.9) 


Nesterov Accelerated Gradient 


e idea: measure cost function gradient slightly ahead in direction of momentum. 
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optimizer = tf.train.MomentumOptimizer ( 
learning_rate=learning_rate, 
momentum=0.9, 
use_nesterov=True) 


AdaGrad 


e Scales gradient vector along steepest dimensions, ie it decays the learning 
rate faster for steep dimensions. (ie adaptive learning rate) 
e Works on simple quadratic problems but often stops too early. 


6, (steep dimension) Cost 





(flatter dimension) 
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RMSProp 


e Fixes AdaGrad problem by accumulating most recent gradients (instead of 
all). 

e Better than AdaGrad on all but very simple problems. Also better than MO 
and Nesterov. 


optimizer = tf.train.RMSPropOptimizer ( 
learning rate-learning rate, 
momentum=0.9, 
decay=0.9, 
epsilon=1e-10) 


Adam Optimization (paper:) 


e Keeps track of decaying past gradients (like Momentum Optimization) 
e Keeps track of decaying past squared gradients (like RMSProp) 


e Default params in TF: 


e Momentum decay param (beta1) usually set to 0.9 
e Scaling decay param (beta2) usually set to 0.999 
e Smoothing term (epsilon) usually set to 10e-8 


optimizer = tf.train.AdamOptimizer ( 
learning_rate=learning_rate) 


Learning Rate Scheduling 


initial learning rate - 0.1 
decay steps - 10000 
decay rate - 1/10 


global step - tf.Variable( 
0, trainable=False) 


learning rate = tf.train.exponential decay( 
initial learning rate, 
global step, 
decay steps, 
decay rate) 


optimizer = tf.train.MomentumOptimizer ( 
learning rate, 
momentum=0.9) 


training op = optimizer.minimize( 
loss, 
global step-global step) 


Regularization Technigues 


Early Stopping 


e Simply interrupt training when validation performance starts dropping. 


L1 & L2 Regularlization 


Dropout 


e Popular technique - typically adds 1-2% accuracy boost 

e At every training step, every neuron has probability (p) of being temporarily 
ignored 

e Typical p = 50% 


e In TF: apply dropout() to input layer & output of every hidden layer. 


from tensorflow.contrib.layers import dropout 


[...] 

is training = tf.placeholder ( 
tf.bool, 
shape=(), 


name-'is training) 
keep_prob = 0.5 


X drop = dropout ( 
X, 
keep_prob, 
is_training=is_training) 


hidden1 = fully_connected( 
X drop, n_hiddeni, scope="hidden1") 


hidden1_drop = dropout( 
hidden1, keep_prob, is_training=is_training) 


hidden2 = fully_connected( 
hidden1 drop, n_hidden2, scope="hidden2") 


hidden2_drop = dropout( 
hidden2, keep_prob, is_training=is_training) 


logits = fully_connected( 
hidden2_drop, n_outputs, 
activation_fn=None, 
scope-"outputs") 


Max-Norm Regularization 


e Each neuron's incoming weights are constrained such that ||w||2 <= r 
e r= max-norm hyperparameter 

e ||.|| =12 norm 

e Reducing r increases regularization 

e Not implemented in TF, but doable. 


Data Augmentation 


e Generating new training instances from existing ones with learnable 
differences 

e ex: pics with shifts/rotates/resizes/flips/contrasts 

e TF has image manipulation ops built-in 


Practical Guidelines 


e Suggested default DNN configurations: 
Initialization: He 

o Activation: ELU 

o Normalization: Batch 


O 


o Regularization: Dropout 
Optimizer: Adam 
Learning Rate schedule: none 


o 


O 


Intro 


Multi Devices, Single Machine 


e Check if GPU cards have nVidia Compute Capability >3.0 
e Alternative using AWS: helpful blog post 

e Google Cloud service: uses TPU hardware 

e Which to use? (Tim Dettmers) 

e Download CUDA 8 CuDNN, set their environment vars 

e use nvidie-smi cmnd to check installation 

e install TF with GPU support 

e open Python shell, verify TF detects CUDA & cuDNN 


import tensorflow as tf 


sess = tf.Session() 


import tensorflow as tf 


config = tf.ConfigProto() 
#config.gpu options.per process gpu memory. fraction-0.4 
config 


Managing GPU RAM 


e TF grabs all GPU RAM on first graph invocation. To run 2nd TF program 
while the 1st is still running, run each process on different GPU cards. 
(Below: program #1 sees GPUs 0,1: program #2 sees GPUs 2,3.) 


$ CUDA VISIBLE DEVICES-0,1 python3 program 1.py 
$ CUDA VISIBLE _DEVICES=3,2 python3 program 2.py 


e Option 2: tell TF to grab a % of memory. (Below: 40% allocation.) 


session = tf.Session(config=config) 
config, session 


Placing Ops on Devices 


Parallel Execution 


e TF Whitepaper - dynamic algorithm, distributes ops across all available 
devices. But not available (yet) in open-source TF. 


Simple Placement 


e Mostly up to you. To pin devices to specific device, use a device() function. 
Below: a,b pinned to cpu#0, c can go anywhere. 


import tensorflow as tf 


with tf.device("/cpu:0"): 

a,b = tf.Variable(3.0), tf.Variable(4.0) 
c = a*b 
C 


Logging Placements 


e Use log device placement-True. This tells placer to log msg whenever a 
node is "placed". 


import tensorflow as tf 


config = tf.ConfigProto() 
config.log device placement = True 
sess = tf.Session(config=config) 
print(config,"An",sess) 


Dynamic Placement 


e You can specify a function instead of a device when creating a device block. 


import tensorflow as tf 


def variables on Cpu(op): 
if op.type == "Variable": 
return “/cpuso" 
else: 
return "/cpu:0" 
with tf.device(variables on cpu): 
a = tf.Variable(3.0) 
tf.constant(4.0) 
a * b 


Ops & Kernels 


e TF operations need to define a kernel to run n a device. Not all ops have 
kernels for both GPUs and CPUs. Example: TF doesn't have integer kernel 
for GPUs. Changin i (below) from 3 to 3.0 should allow op to run. 


import tensorflow as tf 
with tf.device("/gpu:0"): 
1 = tf.Variable(3) 
test - sess.run(i.initializer) 
test 


e To allow TF to "fall back" to a CPU instead, use allow soft placement- True. 


with tf.device("/gpu:0"): 
i = tf.Variable(3) 
config = tf.ConfigProto() 
config.allow_soft_placement = True 
sess = tf.Session(config=config) 
test = sess.run(i.initializer) # the placer runs and falls back 
to /cpu:0 


print(test) 


Parallel Execution 


e TF executes any nodes with zero dependencies first. If those nodes are on 
separate devices, they are run in parallel. If on the same device, they are run 
in different threads & may be run in parallel. 


Control Dependencies 


e Use control dependencies to control/postpone node evaluations (ex: 
premature memory hogging). 


import tensorflow as tf 
a = tf.constant(1.0) 
b = a + 2.0 


with tf.control_dependencies([a,b]): 
x = tf.constant(3.0) 
y = tf.constant(4.0) 


print(xty) 


Multiple Devices - Multiple Servers 


e cluster: 5-1 TF servers ("tasks") across machines. Tasks belong to jobs 
(collections of related tasks) 

e "ps" = parameter server 

e "worker" = computing engine 


cluster spec = tf.train.ClusterSpec({ 
Most: [ 
"machine-a.example.com:2221", + /job:ps/task:0 


l, 


"worker": [ 
"machine-a.example.com:2222", + /job:worker/task:0 
'"machine-b.example.com:2222", + /job:worker/task:1 


15) 


cluster spec 


server.join() 
# blocks main thread until server stops (i.e., never) 


Opening a Session 


# NOT YET WORKING 

# open session 

ta = tf.constant(1.0) 

#b = a + 2 

#C =a * 3 

#with tf.Session("grpc://machine-b.example.com:2222") as sess: 
# print(c.eval()) # 9.0 


Master & Worker Services 


e gRPC protocol to talk to servers. HTTP2 basis, bidirectional 
e based on protocol buffers 
e all servers can provide master & worker services. 


Pinning Ops Across Tasks 


e you can pin ops to any device 
e ex: 


# NOT WORKING YET 

#with tf.device("/job:ps/task:0/cpu:0") 

#a - tf.constant(1.0) 

#with tf.device("/job:worker/task:0/cpu:0") 
#with tf.device("/job:worker/task:0/gpu:1") 
#b = a + 2 

#c = at b 


Sharding Variables across Multiple Param Servers 


e sharding across servers mitigates risk of network card saturation 
e TF distribs variables across all "ps" tasks - round robin setup 


'''NOT WORKING YET 

import tensorflow as tf 

with tf.device(tf.train.replica device setter(ps tasks-2): 
vi = tf.Variable(1.0) # pinned to /job:ps/task:0 
v2 = tf.Variable(2.0) # pinned to /job:ps/task:1 
v3 = tf.Variable(3.0) # pinned to /job:ps/task:0 
v4 = tf.Variable(4.0) # 1 
v5 = tf.Variable(5.0) # 0 


pinned to /job:ps/task: 
pinned to /job:ps/task: 


"NOT WORKING YET\nimport tensorflow as tf\nwith tf.device(tf.tra 
in.replica_device_setter(ps_tasks=2):\n vi = tf.Variable(1.0) 
# pinned to /job:ps/task:0n v2 = tf.Variable(2.0) # pinned 
to /job:ps/task:1\n v3 = tf.Variable(3.0) # pinned to /job:ps 
/task:0Nn V4 = tf.Variable(4.0) + pinned to /job:ps/task:1\n 

V5 = tf.Variable(5.0) # pinned to /job:ps/task:0\n' 


Sharing State across Sessions (Resource 
Containers) 


e local session: all vars managed by session itself & vanish on end. 
e distributed session: vars managed by resource containers on cluster 


'''# simple client.py 

#import tensorflow as tf 

#import sys 

#x = tf.Variable(0.0, name="x") 
#increment x = tf.assign(x, x + 1) 
#with tf.Session(sys.argv[1]) as sess: 
= if sys.argv[2:]==["init"]: 
#sess.run(x.initializer) 
#sess.run(increment_x) 
#print(x.eval()) 


'# simple client.py\n#import tensorflow as tf\n#import sys\n#x = 
tf.Variable(0.0, name="x")\n#increment x = tf.assign(x, x + 1)\ 
n#with tf.Session(sys.argv[1]) as sess:\n# if sys.argv[2:]== 

"init"]:\n#sess.run(x.initializer )\n#sess.run(increment_x)\n#pri 
nt(x.eval())\n' 


# launches client which connects to B, reuses variable x 
# python3 simple_client.py grpc://machine-b.example.com:2222 
#2.0 


Async Communications (TF Oueues) 


e Oueueing data 

e DeQueueing data 

e Queues of tuples 

e Closing a queue 

e RandomShuffleQueue 
e PaddingFifoQueue 


Loading Data Directly from Graph 


e Needed to avoid file server (bandwidth) saturation 
e Preloading data to variables 


e Reading data from graph with reader operations 


O 


CSV, binary, TFRecords 

TextLineReader reads file lines one-by-one 
record identifier (string): filename:linenumber 
tf decode_csv(val, record_defaults=[...]) 


O 


o 


O 


'''TQ LOAD A GRAPH 

instance _ queue = tf.RandomShuffleQueue( 
capacity-10, 
min after degueue-2, 
dtypes=[tf.float32, tf.int32], 
shapes=[[2],[]], 
name="instance_q", 
shared name-"shared instance g") 


engueue instance = instance gueue.engueue( (features, target]) 
Close instance queue = instance_queue.close() 


'TO LOAD A GRAPHNninstance queue = tf.RandomShuffleQueue(An c 
apacity=10, An min after dedueue-2, n dtypes=[tf.float32, 
tf.int32], An shapes=[[2],[]],\n name="instance_q", \n 
shared_name="shared_instance_q")\n\nenqueue_instance = instance_ 
queue.enqueue([features, target])\nclose_instance_queue = instan 
ce_queue.close()\n' 


'''TQ RUN THE GRAPH 
with tf.Session([...]) as sess: 
sess.run(enqueue_filename, feed dict-ifilename: "my_test.csv 
"}) 
sess.run(close_filename_queue) 
Ey: 
while True: 
sess.run(engueue instance) 
except tf.errors.OutOfRangeError as ex: 
pass # no more records in the current file and no more f 
iles to read 
sess.run(close instance gueue) 


"TO RUN THE GRAPH\nwith tf.Session([...]) as sess:\n sess.run 
(engueue filename, feed dict-ffilename: "my_test.csv"})\n ses 
s.run(close_filename_queue)\n try:\n while True:\n 

sess.run(enqueue_instance)\n except tf.errors.OutOfRa 
ngeError as ex:\n pass # no more records in the current f 
ile and no more files to read\n sess.run(close_instance_queue 
Ia 


Multithreaded readers using a Coordinator & QueueRunner 


Other convenience functions 


e string input producer() 
e tf.train.start gueue runners/) 


producer functions - create gueues 


e input producer() 

e range input producer() 

e slice input procucer() 

e shuffle batch(list of tensors) 


o returns RandomShuffleQueue 
o returns QueueRunner (added to GraphKeys.QUEUE_RUNNERS) 


O 


degueue many() = returns minibatch from queue 


o 


batch() --? 


o batch_join() --? 
o shuffle batch join() --? 


One NN per Device 


e near-linear speedup: training 100 nets across 50 servers x 2 gpus/server 
roughly equiv to 1 net on 1 gpu. (perfect for hyperparamer tuning) 


e potential option: tf serving, released 2/2016 


In-Graph vs Between-Graph Replication (for 
Ensembles) 


e Two approaches to building ensembles: 
1) one big graph, one session, any server in cluster ("in graph replication") 


2) one graph/network, handle synchronization yourself ("between graph 
replication") using queues -- considered more flexible 


#RunOptions ... timeout in ms() 

EY Y NOT YET 

with tf.Session([...]) as sess: 
ESA 


run options - tf.RunOptions() 
run options.timeout in ms = 1000 # 1s timeout 
try: 
pred = sess.run(dequeue prediction, options-run options) 
except tf.errors.DeadlineExceededError as ex: 
[...] # the dequeue operation timed out after 1s 


"NOT YETAnwith tf.Session([...]) as sess:\n Sean run. op 


tions = tf.RunOptions()\n run_options.timeout_in_ms = 1000 # 
1s timeout\n try:\n pred = sess.run(dequeue_prediction 
, options=run_options)\n except tf.errors.DeadlineExceededErr 
or as ex:\n [...] # the dequeue operation timed out after 
1s\n' 

H 

NOT NET 


config = tf.ConfigProto() 
config.operation_timeout_in_ms = 1000 
# 1s timeout for every operation 
with tf.Session([...], config=config) as sess: 
[si] 
Eny: 
pred = sess.run(degueue prediction) 
except tf.errors.DeadlineExceededError as ex: 
[...] # the dequeue operation timed out after 1s 


"NOT YETAnconfig = tf.ConfigProto( Nnconfig.operation timeout in 
_ms = 1000\n# 1s timeout for every operation\nwith tf.Session([. 


..], config=config) as sess:\n [== JAN try:\n pred 
= sess.run(degueue prediction)\n except tf.errors.DeadlineExc 
eededError as ex:\n [...] # the dequeue operation timed o 


ut after is\n' 


Model Parallelism 


e Chopping models, running chunks on different devices 
Fully Connect Nets (FCNs): not much value in doing this 
Vertical & Horiz slicing don't work well either 


Nets w/ partially connected layers (CNNs): easier to distribute 


Some RNNs use mem cells (input from own output at t+1) 


Data Parallelism 


e Sync updates (aggregator waits for all gradients to be available, finds avg, 
applies result) - could be delayed by slow devices; params could also 
saturate server bandwidth 

e Async updates - more training steps/minute. issue: "stale gradients" (when 
computing gradients falls behind rate of parameter change) - slows 
convergence, introduces noise/wobble. To avoid this: 

o reduce learning rate 

o drop/scaleback stale gradients 

o adjust minibatch size 

o Start first few epochs with just one replica ("warmup phase") 

e Bandwidth - At some point, more GPUs doesn't help because network 
saturation won't allow more data traffic. google report. Steps you can take: 

o group gpus on single server (avoids network hops) 

o shard params acrosss servers 

o drop precision from float32 to bfloat16 

o 8b precision ("guantization"): see mobile phone apps 

e How TF does it - 

o you choose 1) replication type (in-graph, between-graph) and 2) update 
type (async or sync) 1) in-graph + sync: one big graph 2) in-graph + 
async: 1 optimizer/replica, 1 thread/replica 3) bw-graph + sync: wrap 
optimizer in SyncReplicasOptimizer 


Intro - Visual Cortex 


e LeNet-5 paper - intro'd convo & pooling layers 
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# utilities 
import matplotlib.pyplot as plt 


def plot_image(image): 
plt.imshow( image, cmap="gray", interpolation-"nearest") 
plt.axis("off") 


def plot_color_image(image): 
plt.imshow(image.astype(np.uint8),interpolation="nearest") 
plt.axis("off") 


Convolutional Layers 


e math detail 

e neurons connected to receptor field in next layer. uses zero padding to force 
layers to have same height & width. 

e also can connect large input layer to much smaller layer by spacing out 
receptor fields (distance between receptor fields = stride) 


Layers Padding Strides 
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Filters 


e neuron weights can look like small image (w/ size = receptor field) 

e examples given: 1) vertical filter (single vertical bar, mid-image, all other cells 
zero) 2) horizontal filter (single horizontal bar, mid-image, all other cells zero) 

e both return feature maps (highlights areas of image most similar to filter) 
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import numpy as np 


fmap = np.zeros(shape=(7, 7, 1, 2), dtype=np.float32) 
fmap[:, 3, 0, 0] = 1 

fmap[3, =, 0, 1] = 1 

print(fmap[:, :, 0, 0]) 

print(fmap[:, :, 0, 1]) 


plt.figure(figsize=(6,6)) 


plt.subplot(121) 
plot_image(fmap[:, :, 0, 0]) 
plt.subplot(122) 
plot_image(fmap[:, :, 0, 1]) 
plt.show() 
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from sklearn.datasets import load_sample_image 


china = load_sample_image("china. jpg") 
flower = load_sample_image("flower.jpg") 


image = china[150:220, 130:250] 
height, width, channels = image.shape 


image grayscale = image.mean(axis=2).astype(np.float32) 
images = image grayscale.reshape(1, height, width, 1) 


import tensorflow as tf 


tf.reset default graph() 


# Define the model 


X = tf.placeholder( 
tf.float32, 
shape=(None, height, width, 1)) 


feature_maps = tf.constant(fmap) 


convolution = tf.nn.conv2d( 
X, 
feature_maps, 
strides=[1,1,1,1], 
padding="SAME", 
use_cudnn_on_gpu=False) 


# Run the model 


with tf.Session() as sess: 
output = convolution.eval(feed_dict={X: images!) 


plt.figure(figsize=(6,6)) 


#plt.subplot (121) 

plot image(images[6, :, :, 0]) 
#plt.subplot (122) 
plot_image(output[0, :, :, 01) 
#plt.subplot (123) 
plot_image(output[0, :, :, 11) 
plt.show() 
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Stacking Feature Maps 


e images made of sublayers (one per color channel, typical red/green/blue, 
grayscale = one chan, others = many chans) 
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import numpy as np 
from sklearn.datasets import load_sample_images 


# Load sample images 
dataset = np.array(load sample images().images, dtype=np.float32 


) 


batch_size, height, width, channels = dataset.shape 


# Create 2 filters 

filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32) 
filters[:, 3, :, 0] = 1 # vertical line 

filters[3, :, :, 1] = 1 # horizontal line 


# Create a graph with input X plus a convolutional layer applyin 
g the 2 filters 
X = tf.placeholder(tf.float32, 

shape-(None, height, width, channels)) 


convolution - tf.nn.conv2d( 
X, filters, strides=[1,2,2,1], padding-"SAME") 


with tf.Session() as sess: 
output = sess.run(convolution, feed dict-fX: dataset!) 


plt.imshow(output[0, :, :, 1]) 
plt.show() 
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"Valid" v. "Same" Padding 
padding-"VALID" 
"3 (i.e., without padding) 
Ignored 
Pd 


padding-" SAME" 
(i.e., with zero padding) 
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import tensorflow as tf 
import numpy as np 


tf.reset default graph() 


filter primes - np.array( 
Ee e PS 
dtype=np.float32) 


x = tf.constant( 
np.arange(1, 13+1, dtype-np.float32).reshape( (1, 1, 13, 1))) 


BERE EN RS) 


filters = tf.constant( 
filter primes.reshape(1, 6, 1, 1)) 


conv2d arguments: 

x = input minibatch = 4D tensor 

filters - 4D tensor 

strides - 1D array (1, vstride, hstride, 1) 

padding - VALID - no zero padding, may ignore edge rows/cols 
padding - SAME 


++ tt HH + + 


zero padding used if needed 


valid_conv = tf.nn.conv2d(x, filters, strides=[1, 1, 5, 1], padd 
ing='VALID' >) 
same conv = tf.nn.conv2d(x, filters, strides=[1, 1, 5, 1], padd 
ing='SAME' ) 


with tf.Session() as sess: 
print("VALID:An", valid conv.eval()) 
print("SAME:\n", same conv.eval()) 


Xe 

Tensor("Const:0", shape=(1, 1, 13, 1), dtype-float32) 
VALID: 

[LCE 184.] 


[ 389.]]]] 
SAME : 


[[[[ 143.] 
[ 348.] 


[ 204.1111 


Pooling Layers 


e Goal: subsample (shrink) input image to reduce loading. 
e Need to define pool size, stride & padding type. 

e Result: aggregation function (max, mean) 

e Below: max pool, 2x2, stride = 2, no padding. 





dataset = np.array( (china, flower], dtype-np.float32) 


batch size, height, width, channels - dataset.shape 


filters = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32) 
filters[:, 3, :, 0] = 1 # vertical line 
filters[3, :, :, 1] = 1 # horizontal line 


X = tf.placeholder(tf.float32, 
shape=(None, height, width, channels)) 


# alternative: avg_pool() 


max pool = tf.nn.max pool( 
X, 
ksize=[1, 2, 2, 1], 
strides=[1,2,2,1], 
padding="VALID") 


with tf.Session() as sess: 
output = sess.run(max_pool, feed_dict={X: dataset!) 


plt.figure(figsize=(12,12)) 
plt.subplot(121) 
plot_color_image(dataset[0]) 
plt.subplot(122) 
plot_color_image(output[0]) 
plt.show() 





Memory Reguirements 


e Main memory killer: reverse pass of backprop - needs all intermediate vals 
computed during forward pass 
e Example CNN: 
o 5x5 filters outputting 200 feature maps (size 150,100) 
o stride = 1, "SAME" padding 
o If image = 150x100x3 (RGB), then 
o params count = (5x5x3+1) * 200 = 15,200 
o 200 feature maps contain 150 x 100 neurons => each needs to compute 
weighted sum of 5x5x3 = 75 inputs => 225M floating-point multiplies. 
o If using 32b float => output requires 200x150x100x32 = 96M bits = 
11.4MB for one instance 


During inference: one layer's memory can be dropped when 
next layer is computed. (You only need enough memory for 
two layers). 


During training: all computed values have to preserved for 
reverse pass (You need enough memory for all layers.) 


CNN Architectures 


LeNet-5 (c. 1998, used to solve MNIST digits dataset) 


Layer Type Maps Size Kernel size Stride Activation 
Out Fully Connected — 10 - = RBF 
F6 Fully Connected — 84 - tanh 


G Convolution 120 1x1 5x5 
54 Avg Pooling 16 SS 2x2 
G Convolution 16 10x10 5x5 
S2 Avg Pooling 6 14x14 2x2 tanh 
a Convolution 6 28x28 5x5 tanh 
In Input 1 32x32 - - - 


tanh 
tanh 
tanh 
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e MNIST images zero-padded to 32x32 & normalized 
e pooling layers: mean x learned coefficient + learnable bias 
e output layer: output = Euclidian distance (input vect, weight vect) 


AlexNet won ILSVRC 2012 


Layer Type Maps Size Kernel size Stride Padding Activation 
Out Fully Connected — 1,000 - -= — Softmax 
F9 Fully Connected — 4,096 = - - ReLU 

F8 Fully Connected — 4,096 — - = ReLU 

g Convolution 256 13x13 3x3 1 SAME  ReLU 

6 Convolution 384 13x13 3x3 1 SAME  ReLU 

G Convolution 384 13x13 3x3 1 SAME ReLU 
S4 Max Pooling 256 13x13 3x3 2 VAUD = 

G Convolution 256 27x27 5x5 1 SAME ReLU 
S2 Max Pooling 96 27x27 3x3 2 VALID  - 

C1 Convolution 96 55x55 Tix 4 SAME  ReLU 


In Input 3 (RGB) 224x224 - - - = 


Uses 50% dropout on layers F8, F9 for regularization 
e Uses random image shifts/flips/rotates/lighting to augment dataset 


Uses local response normalization on layers C1, C3. 


Hyperparameter settings: r=2, alpha=0.00002, beta=0.75, k=1 
ZFNet (tweaked AlexNet) won ILSVRC 2013. 


GoogLeNet won ILSVRC 2014 


e Much deeper than previous nets 

e Uses inception modules to use params much more efficiently. They use 1x1 
kernels as "bottleneck layers" (reduces dimensionality). Also: pairs of convo 
layers act as single more powerful convo layer. 


e All convo layers use ReLU activation. 
Inception 
module 
Convolution Convolution Convolution 
1x1 + 1(S 5x5 + 1(S 1x1 + 1(S 


Convolution Max Pool 
1x1 + 1(S 3x3+1(S 
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us || EN 
192, 3x3 + 2(S > 144 32 1000 units 
Local Response 128 256 64 64 Dropout 
Convolution 160 224 64 64 Avg Pool 
192, 3x3 + 1(S SH 112 24 1024, 7x7 + 1(V) 


Convolution 192 208 48 64 384 384 128 128 
64, 1x1 + 1(S la 96 16 tb 192 48 


Local Response Max Poo! 256 320 128 128 
Norm 480, 3x3 + 2(S ED 160 32 
Max Pool 128 192 96 64 Max Pool 
64, 3x3 + 2S < 128 32 832, 3x3 + 2(S 


Convolution 64 128 32 32 56 320 128 128 
64, 7x7 + 2(S cb 96 16 160 32 





Input 


4 cb = inception module 


ResNet 


e 152 layers deep 
e Uses skip connections to connect non-adjacent layers in stack 
e skip connections force learning model f(x) = h(x) - x (residual learning). When 


initialized, weights near zero => network outputs values near-copy of inputs 
h(x) 


h(x) + | 


f(x) = h(x) - x 


i 


identity con, MN | | 


e architecture: stack starts & ends like GoogLeNet, stack of residual units in 
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128, 3x3 + 1(S 
128, 3x3 + 1(S) 









Max Pool 








between. , 3*3 + 1(S) 


TF Convolution Ops 


e conv1d() - 1D layer - good for NLP 

e conv3d() - 3D layer - good for PET scans 

e atrous_conv2D() - 2D layer with "holes" 

e conv2d_transpose() - 2D "deconvolutional layer" - upsamples image by 
inserting zeroes * between inputs 

depthwise_conv2d() - applies every filter to each input channel independently 
separable_conv2d() - depthwise convo, then apply 1x1 CNN layer to result 


%%html 

<style> 

img[alt=recurrent_unrolled] { width: 400px: } 
</style> 

<style> 

img[alt=sequence_vector] { width: 400px; } 
</style> 

<style> 

img[alt=gru-cel1] { width: 400px; } 
</style> 

<style> 

img[alt=encoder-decoder] { width: 400px; } 
</style> 


Intro 


e Use case: arbitrary-length sequence data analysis - anticipation abilities 

e RNNs much like feed-forward NNs, but also with backward-facing 
connections 

e At time step t each node sees input x(t) plus its previous output y(t-1). 

e Below: "unrolling" a net across a time axis. 


Yo Ya) Ye) 
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Time 








Memory Cells 


e A network node that preserves state across time is called a cell (memory 
cell). 
e h(t) is a cell's "hidden" state at time=t. 


te Time 


Input/Output Sequences 


e RNNs can be used to predict the results of time shifts (sequence-to- 
sequence), a sentiment score (seguence-to-vector), or image caption (vector- 
to-sequence). 

e sequence-to-vector nets = encoders; vector-to-sequence nets = decoders. 
One use case: language translation. 

e Below: 

o Top Left: Sequence-to-sequence 


O 


Top Right: Sequence-to-vector 


Oo 


Bot Left: Vector-to-sequence 


O 


Bot Right: Delayed-sequence-to-sequence 
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Basic RNNs in TF 


e RNN design: layer of 5 recurrent cells with tanh activation; runs over 2 time 
steps, and uses vectors of size-3 at each step. 


import tensorflow as tf 


n inputs - 3 
n neurons - 5 


# two-layer net 


XO = tf.placeholder(tf.float32, (None, n_inputs]) 
x1 tf.placeholder(tf.float32, [None, n_inputs]) 


wx = tf.Variable(tf.random_normal(shape=[n_inputs, n_neurons],dt 
ype-tf.float32)) 
wy = tf.Variable(tf.random normal(shape=[n neurons,n neurons],dt 
ype=tf.float32)) 


b = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32)) 


~< 
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tf.tanh(tf.matmul(XO, Wx) + b) 
Y1 = tf.tanh(tf.matmul(YO, Wy) + tf.matmul(X1, Wx) + b) 


init = tf.global_variables_initializer() 


# to feed inputs at both time steps, 


import numpy as np 
# Mini-batch: instance 0,instance 1,instance 2,instance 3 


XO batch = mpr array (PIO 4. 2], [3, 4, 31, [6, 7, al. 19, Or 1]1 
} ft =0 
X1. bateh =np.array([[9, 8, 71, tar 6, 01,116, 5, 4], [6 2, 211 
jegai 


with tf.Session() as sess: 

init.run() 

YO val, Y1 val = sess.run([Y6, Yi], feed dict={X0: XO batch, 
X1: X1_batch}) 


print("output at t=0:\n",YO_val,"\n","output at t=1\n",Y1_val) 


output at t=0: 
[[-0.77183092 -0.99924457 0.23752896 -0.63130957 -0.83723265] 
[-0.92028087 -1. 0.99004787 -0.87230623 -0.99995315] 
[-0.97358704 -1. 0.999919 -0.95966864 -1. ] 
[ 0.99999094 -0.99890459 0.9991411 0.99996841 -0.99999803]] 
output at t=1 
IL 0.99512661 -1. 0.99997395 -0.99830353 -1. ] 
[ 0.99977976 0.99013239 -0.96352106 -0.99476629 0.97579277] 
[ 0.99981618 -0.99989575 0.99114233 -0.99827981 -0.99984008] 
[ 0.54805535 -0.84061396 -0.99912792 -0.47432473 -0.99921536]] 


Unrolling through Time (Static) using static rnn() 


tf.reset default graph() 


n inputs - 3 
n neurons - 5 


XO tf.placeholder(tf.float32, [None, n_inputs]) 
X1 = tf.placeholder(tf.float32, (None, n_inputs]) 


# BasicRNNCell() -- memcell "factory" 


basic_cell = tf.contrib.rnn.BasicRNNCell( 
num_units=n_neurons) 


# static_rnn() -- creates unrolled RNN net by chaining cells. 
# returns 1) python list of output tensors for each time step 
# 2) tensor of final network states 


output_seqs, states = tf.contrib.rnn.static_rnn( 
basic_cell, 
[xo, X1], 
dtype=tf.float32) 


YO, Y1 = output_seqs 


init = tf.global variables initializer() 


# to feed inputs at both time steps, 


import numpy as np 
# Mini-batch: instance 0,instance 1,instance 2,instance 3 


XO- batch = nprarray([[0, 1, 2]. (3, 4, 51, is, 7; al [8 9, 1]] 
E 0 
Xi batch =p. artay([.[9;,*8, 71, 1100 Ol [6 5, 4]. Ia. 2, AI 
jerai 


# YO, Y1 - network outputs at both time steps 


with tf.Session() as sess: 

init.run() 

YO val, Y1 val = sess.run([YO, Yi], feed dict-fX0: XO batch, 
X1: X1_batch}) 


print("output at t=0:\n",YO_val,"\n","output at t=1\n",Y1_val) 


output at t-0: 
[[ 0.42442048 0.92431569 -0.2353479 -0.90074939 -0.94408685] 
[ 0.73783255 0.98977458 -0.72123086 -0.99919385 -0.99999249] 
[ 0.89336294 0.99865782 -0.9186905 -0.99999398 -1. ] 
[-0.99143326 -0.99993676 -0.37607926 0.88796568 -0.99899191]] 
output at t=1 
LL 0.81709599 0.48319042 -0.96708876 -0.9998284 -1. ] 
[-0.18962485 -0.81231028 -0.21763545 0.88739753 0.57306314] 
[ 0.17130674 -0.6411857 -0.86380148 -0.95413983 -0.99999553] 
[-0.07749119 -0.86547101 -0.00461033 -0.91877526 -0.99582738] ] 


Simplification 


tf.reset default graph() 


n steps - 2 
n_inputs = 3 
n_neurons = 5 


# this time, use placeholder with add'l dimension for #timesteps 
#X0 = tf.placeholder(tf.float32, (None, n_inputs]) 

#X1 = tf.placeholder(tf.float32, (None, n_inputs]) 

X = tf.placeholder(tf.float32, (None, n steps, n_inputs]) 


Hprint(X) 


# transpose - make time steps - 1st dimension 
# unstack - extract list of tensors 
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X segs - tf.unstack( 
tf.transpose( 
X, perm=[1, 0, 2])) 


#print(X segs) 


# BasicRNNCell() -- memcell "factory" 


basic cell = tf.contrib.rnn.BasicRNNCell( 
num units-n neurons) 


# static rnn() -- creates unrolled RNN net by chaining cells. 
# returns 1) python list of output tensors for each time step 
# 2) tensor of final network states 


output segs, states - tf.contrib.rnn.static rnn( 
basic cell, 
X segs, 
dtype=tf.float32) 


#YO, Y1 = output segs 
# stack - merge output tensors 
# transpose - swap 1st two dimensions 
# returns tensor shape (none, #steps, #neurons] 
outputs = tf.transpose( 
tf.stack(output segs), 


perm=[1,0,2]) 


init = tf.global_variables_initializer() 


X batch - np. 


1) 


[Lo, 


1 
[[3, 4 
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[[9, © 


2], 
51, 
81, 
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array([ 


(9, 
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[3, 
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with tf.Session() as sess: 


init.run() 


outputs_val = outputs.eval(feed_dict={X: X_batch)) 


print(outputs val) 


. 76157701 
.99998951 


. 99683905 
. 41841054 


.99996316 
.99907684 


. 12318966 
.9525854 


O. 
-0. 


@. 
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11581181 
66595364 


. 29572889 
. 92049074 


. 45685658 
.87088716 


02264917 
56515652 
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64773971 
99812627 


. 98365188 
.64612901 


. 99936479 
. 94328976 


. 99982244 
. 08665188 


-0. 
aal 


-0. 
-0. 


Le 
-0. 


-0. 
“OP 


79434019 


99992883 


73361856 


9999997 


99998975 
99705428 


. 86054337] 
.84574401]] 


.88169324] 
.29283327]] 


. 89980829] 
.87934762]] 


. 99996465] 
. 87525886] ] 


e Above code still not ideal - builds graph with one cell per time step. Ugly & 


can cause Out Of Memory errors. 


Unrolling through Time using dynamic rnn() 


e uses while loop() to iterate over the memcell 


e setswap memory-True to move GPU memory to CPU during backprop if 


needed 


e accepts single tensor, outputs single tensor - no stack/unstack/transpose ops 


required. 


tf.reset default graph() 
X = tf.placeholder(tf.float32, (None, n steps, n_inputs]) 


basic cell = tf.contrib.rnn.BasicRNNCell( 
num units-n neurons) 


outputs, states - tf.nn.dynamic_rnn( 
basic_cell, X, dtype=tf.float32) 


init = tf.global variables initializer() 
with tf.Session() as sess: 
init.run() 
outputs val = outputs.eval(feed dict={X: X batch!) 
print(outputs val) 
[[[ 0.01341763 -0.10483158 -0.94257653 0.83843452 -0.20272173] 


[ 0.99978089 -0.63150525 -0.99999148 0.99999386 -0.87993085] ] 


[[ 0.94205797 -0.13386673 -0.9997741 0.99812031 -0.64444101] 
[-0.6134249 -0.55738503 0.39783546 0.89031053 0.04465704]] 


[[ 0.99817288 -0.16267382 -0.99999928 0.99997997 -0.86824256] 
[ 0.99097538 -0.61533296 -0.99695957 0.99986053 -0.64558744]] 


[[ 0.9963541 0.23641461 0.75174934 0.98267573 -0.97034496] 
[ 0.85169196 -0.07830215 -0.3604137 0.95550352 0.12307668]] 


Variable-Length Input Sequences 


e Most problems will have variable length inputs (like sentences). 
e This option uses sequence_length param (1D tensor) 


tf.reset default graph() 


X = tf.placeholder(tf.float32, (None, n steps, n_inputs]) 


seg length tf.placeholder(tf.int32, [None]) 
basic cell = tf.contrib.rnn.BasicRNNCell( 
num units-n neurons) 


outputs, states - tf.nn.dynamic_rnn( 
basic_cell, X, dtype=tf.float32, 
# 
# 
sequence_length=seq_length) 
# 
# 

X batch = np.array([ 


Mo di 2). [Ss AE instance T 
[[3, 4, 5], [0, 0, 0]], # instance 2 -- zero padded 
[[6, 7, 81, [6, 5, 4]], + instance 3 
DIS 9. dy. (as 2. dd instance 4 


1) 


seg length batch = np.array([2,1,2,2]) 


init = tf.global variables initializer() 


with tf.Session() as sess: 
init.run() 
outputs val, states val = sess.run( 
[outputs, states], 
feed dict-fX: X batch, seg length: seg length batch!) 


# RNN should output zero vectors for any time step 
# beyond input seguence length 
print(outputs val) 


.28581977 
.99970448 


.96786171 


. 99903995 


.96896154 


.9976812 
.57188803 
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print(states val) 


IL 0.99970448 -1. 
[ 0.96786171 -0.99937457 -0.03243476 
[ 0.96896154 -0.99999189 0.43341497 
[ 0.57188803 -0.99268627 -0.30526906 


77421445 


. 99937457 


. 99999839 
. 99999189 


99999118 
99268627 


. 34181327 
. 19238343 


. 03243476 


. 28328663 


. 43341497 


.99979782 
. 30526906 


0. 79238343 
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sile 
-0.99988878 
-0.99996883 
-0.99518502 


87167971 


. 99988878 


. 99999982 
. 99996883 


99983948 
99518502 


Variable-Length Output Sequences 


e Typical output sequence lengths not equal to input lengths 


-0. 
-0. 


91387445] 
9997654 ]] 


.99875116] 
O. ]] 


-0. 99998271] 
-0. 98279852] ] 


0.84931362] 
0.109933 JJ 


-0.9997654 ] 
-0.99875116] 
-0.98279852] 
0.109933 JJ 


e Most common solution: use end-of-sequence (EOS) token. 


RNN Training 


e Unroll through time (as shown above) then use backprop through time 


CV ay» Yeay Ya) 
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RNN Training: Classifier 


(BPTT). 


e Example: use MNIST (CNN would be better, but lets keep it simple) 
e Treat images as 28 rows of 28 pixels each 

e Use 150 rnn cells + fully-connected layer of 10 cells (1 per class) 

e Followed by softmax layer 








# similar to MNIST classifier 
+ unrolled RNN replaces hidden layers 


tf.reset_default_graph() 

from tensorflow.contrib.layers import fully_connected 
n_steps = 28 

n_inputs = 28 


n_neurons = 150 
n_outputs = 10 


learning rate = 0.001 


tf.placeholder(tf.float32, (None, n steps, n_inputs]) 
y = tf.placeholder(tf.int32, [None]) 


basic cell = tf.contrib.rnn.BasicRNNCell( 
num_units=n_neurons) 


outputs, states = tf.nn.dynamic_rnn( 
basic_cell, X, dtype=tf.float32) 


logits = fully_connected( 
states, n_outputs, activation_fn=None) 


xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits( 
labels=y, logits-logits) 


loss = tf.reduce mean( 
xentropy) 


optimizer = tf.train.AdamOptimizer ( 
learning_rate=learning_rate) 


training_op = optimizer.minimize( 
loss) 


correct = tf.nn.in top k( 
logits, y, 1) 


accuracy = tf.reduce_mean( 
tf.cast(correct, tf.float32)) 


init = tf.global_variables_initializer() 


# load MNIST data, reshape to (batch size, n steps, n_inputs] 
from tensorflow.examples.tutorials.mnist import input data 
mnist = input data.read data sets("/tmp/data/") 

X test - mnist.test.images.reshape((-1, n steps, n inputs)) 


y test = mnist.test.labels 


Extracting /tmp/data/train-images-idx3-ubyte.gz 
Extracting /tmp/data/train-labels-idx1-ubyte.gz 
Extracting /tmp/data/t10k-images-idx3-ubyte.gz 
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz 


# ready to run. reshape each training batch before feeding to ne 


Ee 


n. epochs 
batch size 


10 


150 


with tf.Session() as sess: 


init.run() 


for epoch in range(n epochs): 


size): 


for iteration in range(mnist.train.num examples // batch 


X batch, y batch - mnist.train.next batch(batch size 


X batch = X_batch.reshape( 
(-1, n. steps, n inputs)) 


sess.run( 
training op, 
feed dict-fX: X batch, y: y batch!) 


acc train = accuracy.eval( 


feed dict-fX: X batch, y: y batch!) 


acc test - accuracy.eval( 


feed dict-fX: X test, y: y test!) 


print(epoch, 


"Train accuracy:", acc train, 
"Test accuracy:", acc test) 
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Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 
Train 


accuracy: 0.953333 Test accuracy: 0.8711 
accuracy: 0.953333 Test accuracy: 0.9417 
accuracy: 0.953333 Test accuracy: 0.9432 
accuracy: 0.946667 Test accuracy: 0.9595 
accuracy: 0.98 Test accuracy: 0.9627 

accuracy: 0.966667 Test accuracy: 0.9666 
accuracy: 0.96 Test accuracy: 0.961 

accuracy: 0.973333 Test accuracy: 0.9729 
accuracy: 0.986667 Test accuracy: 0.9702 
accuracy: 0.986667 Test accuracy: 0.9732 


RNN Training: Predicting Time Series 





— Atraining instance 


A time series (generated) 








A training instance 











10 15 
Time 


t min, t max = 0, 30 


resolution = 0.1 


def 


def 


time _series(t): 
return t * np-sin(t) / 3 + 2 *nmp.sin(t 5) 


next_batch(batch_size, n_steps): 

to = np.random.rand(batch_size, 1) * (t_max - t_min - n_step 
resolution) 

Ts = tO + np.arange(0., n steps + 1) * resolution 

ys = time_series(Ts) 

return ys[:, :-1].reshape(-1, n steps, 1), ys[:, 1:].reshape( 
n_steps, 1) 


np.linspace(t_min, t_max, (t_max - t_min) // resolution) 


n_steps = 20 


t instance = np.linspace( 


12.2, 12.2 + resolution * (n_steps + 1), n_steps + 1) 


Ed! = El 


# each training instance - 20 inputs long 
# targets - 20-input seguences 


tf.reset default graph() 


n steps - 20 

n inputs = 1 

n neurons - 100 
n outputs - 1 


tf.placeholder(tf.float32, (None, n steps, n_inputs]) 
y = tf.placeholder(tf.float32, (None, n steps, n_outputs]) 


cell = tf.contrib.rnn.BasicRNNCell( 
num_units=n_neurons, 
activation=tf.nn.relu) 


outputs, states = tf.nn.dynamic_rnn( 
cell, X, dtype=tf.float32) 


print(outputs.shape) 


(?, 20, 100) 


# output at each time step now vector[100], 
# but we want single output value at each step. 


# use OutputProjectionwrapper () 
# -- adds FC layer to top of each output 


cell = tf.contrib.rnn.OutputProjectionwrapper ( 
tf.contrib.rnn.BasicRNNCell( 
num_units=n_neurons, 
activation-tf.nn.relu), 
output size-n outputs) 





# define cost function using MSE 


# use Adam optimizer 


learning rate - 0.001 
loss - tf.reduce mean( 
tf.sguare(outputs - y)) 


optimizer = tf.train.AdamOptimizer ( 
learning rate-learning rate) 


training op - optimizer.minimize(loss) 
init = tf.global variables initializer() 


initialize & run 


init = tf.global variables initializer() 
n iterations - 1000 
batch size - 50 


with tf.Session() as sess: 
init.run() 
for iteration in range(n iterations): 
X batch, y batch - next batch(batch size, n steps) 
sess.run(training op, feed dict={X: X batch, y: y batch} 


if iteration % 100 == 
mse = loss.eval(feed_dict={X: X batch, y: y batch!) 
print(iteration, "\tMSE:", mse) 


# use trained model to make some predictions 
X new = time_series(np.array(t_instance[:-1].reshape(-1, n_s 
teps, n_inputs))) 

y_pred = sess.run(outputs, feed_dict={X: X_new}) 


print(y_pred) 


MSE: 15.3099 


100 MSE: 13.5276 
200 MSE: 11.0956 
300 MSE: 9.91156 
400 MSE: 14.0311 
500 MSE: 9.73811 
600 MSE: 9.23351 
700 MSE: 9.64445 
800 MSE: 8.98904 
900 MSE: 10.849 
LIL 9. O. 
0. ] 
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import matplotlib.pyplot as plt 


plt.title("Testing the model", fontsize-14) 


plt.plot ( 
t_instance[:-1], 
time_series(t_instance[:-1]), 
"bo", markersize=10, label="instance") 


plt.plot( 
t_instance[1:], 
time_series(t_instance[1:]), 
"w*", markersize=10, label="target") 


plt.plot( 
t_instance[1:], 
y_pred[0,:,0], 
"r.", markersize=10, label="prediction") 


plt.legend(loc="upper left") 
plt.xlabel("Time") 
#save fig("time series pred plot") 





plt.show() 
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e OutputProjectionWrapper() - simplest solution for reducing output 
seguences to one value/timestep, but not most efficient. 
e More efficient solution shown below - signficant speed boost. 


tf.reset default graph() 


n steps = 20 

n inputs - 1 

n neurons - 100 
n outputs = 1 


tf.placeholder(tf.float32, (None, n steps, n_inputs]) 
y = tf.placeholder(tf.float32, (None, n steps, n outputs]) 


cell = tf.contrib.rnn.BasicRNNCell( 
num_units=n_neurons, 
activation=tf.nn.relu) 


rnn_outputs, states = tf.nn.dynamic_rnn( 
cell, X, dtype=tf.float32) 


t stack outputs using reshape 
stacked_rnn_outputs = tf.reshape( 
rnn_outputs, [-1, n neurons]) 


print(stacked_rnn_outputs) 
+ add FC layer -- just a projection, so no activation fn needed 
stacked_outputs = fully connected( 

stacked_rnn_outputs, 

n_outputs, 

activation_fn=None) 


print(stacked_outputs) 


# unstack outputs using reshape 


outputs = tf.reshape( 
stacked_outputs, [-1, n steps, n outputs]) 


print (outputs) 


loss - tf.reduce sum(tf.sguare(outputs - y)) 
optimizer = tf.train.AdamOptimizer(learning rate-learning rate) 


training op - optimizer.minimize(loss) 


#initialize € run 
init = tf.global variables initializer() 


n iterations = 1000 
batch size - 50 


with tf.Session() as sess: 
init.run() 
for iteration in range(n iterations): 
X batch, y batch - next batch(batch size, n steps) 
sess.run(training op, feed dict={X: X batch, y: y batch} 


if iteration % 100 == 
mse = loss.eval(feed_dict={X: X batch, y: y batch!) 
print(iteration, "\tMSE:", mse) 


# use trained model to make some predictions 

X new = time_series(np.array(t_instance[:-1].reshape(-1, n_s 
teps, n_inputs))) 

y_pred = sess.run(outputs, feed_dict={X: X_new}) 

print(y_pred) 


Tensor("Reshape:0", shape-(?, 100), dtype-float32) 
Tensor("fully connected/BiasAdd:0", shape=(?, 1), dtype-float32) 
Tensor("Reshape 1:0", shape-(?, 20, 1), dtype-float32) 
0 MSE: 22963.7 
100 MSE: 743,444 
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800 MSE: 53.4219 
900 MSE: 43.2203 
. 46527553] 

. 46867704] 

. 10144436] 
.69717044] 
.08823276] 
.13628578] 
.55210543] 
.4186697 | 

. 85978389] 

. 15520501] 
.67705297] 
.6919663 ] 

. 93633199] 

. 70151305] 
.87054777] 
.11770582] 
.15701818] 

. 71814394] 
.69798708] 
.08309698]]] 


FJ 
FEE 


DO OO OO OU WO ND kb kb ke ND ND OO © LO ND OH NW 


ees ME GM ian MAN EE ES pas ME es Ed er HI ee O ps UR RETRY re sine CEN mena ER sara N. en pas A ag ER ON res | 


plt.title("Testing the model", fontsize-14) 


plt.plot(t_instance[:-1], time series(t instancef:-1|), "bo", ma 


rkersize-10, label-"instance”) 
plt.plot(t_instance[1:], time_series(t_instance[1:]), "w*", 
ersize=10, label="target") 


plt.plot(t_instance[1:], y_pred[0,:,0], "r.", markersize=10, lab 


el="prediction") 
plt.legend(loc="upper left") 
plt.xlabel("Time") 





plt .show( ) 
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Creative RNNs 


e Use model to generate creative sequences 
e Provide seed sequence of length = n_steps, zero-filled 
e use model to append predicted new value to sequence 


feed last n_steps values to model to predict next value, etc. 


should get new sequence resembling original time series 


n_iterations = 2000 
batch_size = 50 


with tf.Session() as sess: 
init.run() 
for iteration in range(n_iterations): 
X_batch, y_batch = next_batch(batch_size, n_steps) 


sess.run(training op, feed dict-fX: X batch, y: y batch} 


if iteration % 100 == 
mse = loss.eval(feed_dict={X: X batch, y: y batch!) 
print(iteration, "\tMSE:", mse) 


seguencel = [0. for i in range(n_steps) ] 
for iteration in range(len(t) - n_steps): 
X batch = np.array(sequence1[-n_steps:]).reshape(1, n_st 
eps, 1) 
y_pred = sess.run(outputs, feed_dict={X: X_batch}) 
sequencel.append(y_pred[0, -1, 0]) 


sequence2 = [time_series(i * resolution + t_min + (t_max-t_m 
in/3)) for i in range(n_steps)] 
for iteration in range(len(t) - n steps): 
X_batch = np.array(sequence2[-n_steps:]).reshape(1, n_st 
eps, 1) 
y_pred = sess.run(outputs, feed dict-fX: X batch!) 
sequence2.append(y_pred[0, -1, 0]) 


plt.figure(figsize-(11,4)) 

plt.subplot(121) 

plt.plot(t, sequence1, "b-") 

plt.plot(t[:n_steps], seguencel|:n steps], "b-", linewidth=3) 
plt.xlabel("Time") 

plt.ylabel("Value") 


plt.subplot (122) 

plt.plot(t, seguence2, "b-") 

plt.plot(t[:n_steps], sequence2[:n_steps], "b-", linewidth=3) 
plt.xlabel("Time”) 

#save fig("creative seguence plot") 


plt.show() 


0 MSE: 14607.1 


100 MSE: 505.605 
200 MSE: 167.29 

300 MSE: 83.1336 
400 MSE: 58.9695 
500 MSE: 61.0224 
600 MSE: 55.8671 
700 MSE: 43.7078 
800 MSE: 57.2013 
900 MSE: 55.3992 
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1100 MSE: 55.48 

1200 MSE: 39.4618 
1300 MSE: 40.7414 
1400 MSE: 47.8548 
1500 MSE: 43.9252 
1600 MSE: 47.892 

1700 MSE: 42.0762 
1800 MSE: 48.2429 
1900 MSE: 42.7509 
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Deep RNNs 
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e Built by stacking cells into a MultiRNNCell(). 


338 


tf.reset default graph() 
n_inputs - 2 

n neurons - 100 

n layers - 3 

n steps - 5 

keep prob - 0.5 


X = tf.placeholder(tf.float32, (None, n steps, n_inputs]) 


basic cell = tf.contrib.rnn.BasicRNNCell( 
num units-n neurons) 


print(basic cell) 


multi layer cell = tf.contrib.rnn.MultiRNNCell( 
[basic_cell] * n layers) 


print(multi layer cell) 


states = tuple (one tensor/layer, = final state of layer ': 


outputs, states - tf.nn.dynamic_rnn( 
multi layer cell, X, dtype=tf.float32) 


init = tf.global variables initializer() 


import numpy.random as rnd 
X batch = rnd.rand(2, n steps, n inputs) 


with tf.Session() as sess: 
init.run() 
outputs val, states val = sess.run( 
[outputs, states], 
feed dict-fX: X_batch}) 


print(outputs val.shape) 


<tensorflow.contrib.rnn.python.ops.core rnn cell impl.BasicRNNCe 
11 object at Ox7fd1ff3dbb00> 
<tensorflow.contrib.rnn.python.ops.core rnn cell impl.MultiRNNCe 
11 object at 0x7fd1d9b7c9e8> 

(2, 5, 100) 


DRNNs: Multiple GPUs 


e TODO 


Dropout 


e Very deep RNNs = danger of overfit. Use dropout to avoid problem. 
e Can apply before or after RNN 
e If applying dropout between RNN layers, need to use DropoutWrapper. 


# apply 50% dropout to inputs of RNN layers 
can apply dropout to outputs via output_keep_prob 





tf.reset_default_graph() 
from tensorflow.contrib.layers import fully_connected 


n_inputs = 1 
n_neurons = 100 
n_layers = 3 
n_steps = 20 


Il 
Hs 


n_outputs 


6,5 
learning_ rate = 0.001 


keep_prob 


def deep rnn with dropout(X, y, is training): 


TF implementation of Dropoutwrapper doesn't differentiate 





tween training & testing 


cell = tf.contrib.rnn.BasicRNNCell( 


num units-n neurons) 


if is training: 
cell = tf.contrib.rnn.Dropoutwrapper ( 
cell, input keep prob-keep prob) 


multi layer cell = tf.contrib.rnn.MultiRNNCell( 
[cell] * n layers) 


rnn outputs, states - tf.nn.dynamic_rnn( 
multi layer cell, X, dtype-tf.float32) 


stacked rnn outputs = tf.reshape( 
rnn_outputs, [-1, n neurons]) 


stacked outputs - fully connected( 
stacked rnn outputs, n outputs, activation fn-None) 


outputs = tf.reshape( 
stacked_outputs, [-1, n steps, n_outputs]) 


loss = tf.reduce_sum( 
tf.square(outputs - y)) 


optimizer = tf.train.AdamOptimizer ( 
learning_rate=learning_rate) 


training_op = optimizer.minimize(loss) 


return outputs, loss, training_op 


X = tf.placeholder(tf.float32, (None, n steps, n inputs]) 

y = tf.placeholder(tf.float32, [None, n_steps, n_outputs]) 
outputs, loss, training op = deep_rnn_with_dropout(X, y, is_trai 
ning) 

init = tf.global variables initializer() 

saver - tf.train.Saver() 


e Dropout, in this code, works during both training & testing (don't want). 
e dropout wrapper() doesn't know how to handle this, so you need one graph 
for training, another for testing. 


n iterations = 2000 
batch size - 50 


is training - True 


with tf.Session() as sess: 
if is training: 
init.run() 
for iteration in range(n iterations): 
X batch, y batch - next batch(batch size, n steps) 
sess.run( 
training op, 
feed dict-fX: X batch, y: y batch!) 


if iteration % 100 == 
mse = loss.eval( 
feed dict-fX: X batch, y: y batch!) 


print(iteration, "\tMSE:", mse) 
save_path = saver.save(sess, "/tmp/my_model.ckpt") 


else: 
saver.restore(sess, "/tmp/my_model.ckpt") 


X_new = time_series( 
np.array(t_instance[:-1].reshape(-1, n steps, n_inpu 
ts))) 
y_pred = sess.run( 
outputs, feed_dict={X: X_new}) 


plt.title("Testing the model", fontsize=14) 

plt.plot(t_instance[:-1], time_series(t_instance[:-1]), 
"bo", markersize=10, label="instance" ) 

plt.plot(t_instance[i:], time_series(t_instance[1:]), "w 


*", markersize=10, label="target") 


plt.plot(t_instance[1:], y_pred[0,:,0], "r.", markersize= 
10, label="prediction") 
plt.legend(loc="upper left") 
plt.xlabel("Time") 
plt.show() 
«| — Dl 
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# testing 


with tf.Session() as sess: 


saver.restore(sess, 


X new = time _series( 


np.array(t_instance[:-1].reshape(-1, n steps, 


y_pred - sess.run( 
outputs, feed dict-fX: X_new}) 


"/tmp/my model.ckpt") 


n inputs)) 


plt.title("Testing the model", fontsize-14) 


plt.plot(t_instance[:-1], 
, markersize=10, label="instance") 

plt.plot(t_instance[1:], 
markersize=10, label="target") 

plt.plot(t_instance[1:], y_pred[0,:,0], 
label="prediction") 

plt.legend(loc="upper left") 

plt.xlabel("Time") 

plt.show() 
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Training across Many Time Steps 


time_series(t_instance[:-1]), 


time_series(t_instance[1:]), 


"bo" 


Ua LL : 


"r,", markersize=10, 


e problem #1: RNNs susceptible to vanishing/exploding gradients issues. 
Previous tricks will work, but training time = prohibitively long for even modest 
sequences. 

e solution #1: truncated backprop thru time (unrolling RNN over limited number 
of timesteps during training). Works, but model will not be able to learn long- 
term patterns. 

e problem #2: memory of early inputs fades away - information lost during each 
transformation. 

e solution #2: using a long-term memory cell. 


Long Short-Term Memory (LSTM) Cell 


Yo 


Forget gate 


a Element-wise : 
multiplication | 


@ Addition 





LSTM cell pza logistic 
me tanh 


e implemented via BasicLSTMCell() instead of BasicRNNCell(). 


e key feature: net learns what to store (long-term), what to read from, what to 
throw away. 


e Four FC layers - each with unique purposes: 
o main layer: outputs g(t) 
o forget gate: controlled by f(t) - decides which parts of long-term memory 
to erase 
o input gate: controlled by i(t) - decides which parts of g(t) to add to long- 
term memory 
o output gate: controlled by o(t) - decides which parts of long-term state 


should be read & outputted at this time step. 


tf.reset default graph() 

from tensorflow.contrib.layers import fully connected 
n steps = 28 

n_inputs = 28 

n neurons - 150 


n_outputs = 10 


learning_rate = 0.001 


tf.placeholder(tf.float32, (None, n steps, n inputs]) 
tf.placeholder(tf.int32, [None]) 


< 
Il 


1stm cell = tf.contrib.rnn.BasicLSTMCell( 
num_units=n_neurons) 


multi cell = tf.contrib.rnn.MultiRNNCell( 
[istm cel1]*3) 


outputs, states = tf.nn.dynamic rnn( 
multi_cell, X, dtype=tf.float32) 


top layer h state = states[-1][1] 


logits = fully connected( 
top layer h state, 
n outputs, 
activation_fn=None, scope-"softmax") 


xentropy = tf.nn.sparse softmax cross entropy with logits( 
labels=y, logits-logits) 


loss - tf.reduce mean( 
xentropy, name="loss") 


optimizer = tf.train.AdamOptimizer ( 
learning rate-learning rate) 


training op - optimizer.minimize(loss) 


correct - tf.nn.in top k( 
Togits, y, 1) 


accuracy = tf.reduce_mean( 
tf.cast(correct, tf.float32)) 


init = tf.global_variables_initializer() 


states 


(LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_2:0' shape=(?, 150) 
dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_3:0' shape=(?, 150 
) dtype=float32>), 
LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_4:0' shape=(?, 150) 
dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_5:0' shape=(?, 150 
) dtype=float32>), 
LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_6:0' shape=(?, 150) 
dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_7:0' shape=(?, 150 
) dtype=float32>)) 


top_layer_h_state 


<tf.Tensor 'rnn/while/Exit_7:0' shape=(?, 150) dtype=float32> 


n epochs - 10 
batch size - 150 


with tf.Session() as sess: 
init.run() 
for epoch in range(n epochs): 
for iteration in range(mnist.train.num examples // batch 


size): 
X batch, y batch - mnist.train.next batch(batch size 
) 
X batch - X batch.reshape( (batch size, n steps, n in 
puts)) 
sess.run(training op, feed dict-fX: X batch, y: y ba 
tch)) 
acc train = accuracy.eval(feed dict={X: X batch, y: y ba 
tch)) 
acc test = accuracy.eval(feed dict={X: X test, y: y test 
5) 


print("Epoch", epoch, "Train accuracy -", acc train, "Te 
st accuracy -", acc test) 


Epoch 0 Train accuracy = 0.966667 Test accuracy = 0.9403 
Epoch 1 Train accuracy = 0.98 Test accuracy = 0.9742 
Epoch 2 Train accuracy = 0.993333 Test accuracy = 0.979 
Epoch 3 Train accuracy = 0.993333 Test accuracy = 0.9805 
Epoch 4 Train accuracy = 1.0 Test accuracy = 0.9854 
Epoch 5 Train accuracy = 0.98 Test accuracy = 0.9827 
Epoch 6 Train accuracy = 0.993333 Test accuracy = 0.9851 
Epoch 7 Train accuracy = 1.0 Test accuracy = 0.9865 
Epoch 8 Train accuracy = 1.0 Test accuracy = 0.9887 
Epoch 9 Train accuracy = 0.993333 Test accuracy = 0.9871 


Peephole Connections 


e Basic LSTM cell: gate controllers only see input x(t) & prev short-term state 
h(t-1). 
e Improvement: let gate peek at long-term state too. Provided with previous 


long-term state c(t-1) as inputs to forget gate & input gate; current long-term 
state c(t) added as input to output gate controller. 


# Peepholes in TF 

1stm cell = tf.contrib.rnn.LSTMCell( 
num_units=n_neurons, 
use_peepholes=True) 


Gated Recurrent Unit (GRU) Cell 


e Simplified version of LSTM cell 

e State vectors merged into single h(t). 

e Single gate controller manages forget gate & input gate. (if a memory is to be 
stored, its location is erased first.) 


e No output gate - full state vector output on 
Vit) 





# in TF 
gru cell = tf.contrib.rnn.GRUCel1(num_units=n_neurons) 


Natural Language Processing (NLP) 


e Mostly based on RNNs 
e See Word2Vec and Seq2Seq tutorials! 
e More: Chris Olah, Sebastian Ruder 


Word Embeddings 


e First: need a word representation. Similar words should have similar 
representations. 
e Common sol'n: each word in vocab = small, dense vector of embeddings. 


` ye L 7 : Ee SES | 1 1 
create empedc IgS al able Nit wW nN 1 I0M| - 1 L | 


vocabulary size = 50000 
embedding size = 150 
embeddings = tf.Variable( 
tf.random_uniform( 
[vocabulary_size, embedding_size], 
AO 1000 


e Feeding new sentences to net: replace unknown words, numbers, URLs, etc 
with predefined tokens. Once a word is known, you can look it up in a 
dictionary. 


train_inputs = tf .placeholder ( 
tf.int32, shape=[None]) 


embed = tf.nn.embedding_lookup( 
embeddings, train_inputs) to embeddi 


English => French Encoder-Decoder Network (link) 


e English inputs, French outputs 

e French translations also fed, pushed back one step 

e English sentences reversed before entry (ensures beginning of sentence is 
fed last = best for decoder translation) 

e Decoder returns score for each word in output vocabulary - softmax turns 
them into probabilities. Highest probability word is returned. 
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from six.moves import urllib 


import errno 
import os 
import zipfile 


WORDS PATH = "datasets/words" 
WORDS URL - 'http://mattmahoney.net/dc/text8.zip' 


def mkdir p(path): 
TN Create directories, ok if they already exist. 


This is for python 2 support. In python >=3.2, simply use: 
>>> os.makedirs(path, exist_ok=True) 
CAVE 
os.makedirs(path) 
except OSError as exc: 
if exc.errno == errno.EEXIST and os.path.isdir(path): 
pass 
else: 
raise 


def fetch words data(words url-WORDS URL, words path-WORDS PATH): 


os.makedirs(words_path, exist ok-True) 
Zip path = os.path.join(words path, "words.zip") 
if not os.path.exists(zip path): 
urllib.reguest.urlretrieve(words url, zip path) 
with zipfile.ZipFile(zip path) as f: 
data = f.read(f.namelist()[0]) 
return data.decode("ascii").split() 


AA | 


words = fetch_words_data() 
words[:5] 


['anarchism', 'originated', 'as', 'a', 'term'] 


Build dictionary 


from collections import Counter 
vocabulary_size = 50000 
vocabulary = [("UNK", None)] + Counter(words).most_common(vocabu 


lary_size - 1) 
vocabulary = np.array([word for word, _ in vocabulary]) 


dictionary = {word: code for code, word in enumerate(vocabulary) 


} 


data = np.array([dictionary.get(word, 0) for word in words]) 


" ",join(words[:9]), data[:9] 


('anarchism originated as a term of abuse first used', 
array( (5244, 3081, 12, 6, 195, 2 13195 46, 59])) 


" " join([vocabulary[word_index] for word index in [5241, 3081, 
12, 6, 195, 2, 3134, 46, 5911) 


'anywhere originated as a term of presidency first used' 


words[24], data[24] 


('culottes', 0) 


Generate batches 


import random 
from collections import degue 


def generate batch(batch size, num skips, skip window): 
global data index 
assert batch size % num_skips == 
assert num skips <= 2 * skip window 
batch - np.ndarray(shape-(batch size), dtype-np.int32) 
labels - np.ndarray(shape-(batch size, 1), dtype-np.int32) 
span - 2 * skip window + 1 # | skip window target skip windo 


buffer - degue(maxlen-span) 
for _ in range(span): 
buffer .append(data[data_index] ) 
data_index = (data_index + 1) % len(data) 
for i in range(batch_size // num_skips): 
target = skip_window # target label at the center of th 
e buffer 
targets_to_avoid = [ skip_window ] 
for j in range(num_skips): 
while target in targets_to_avoid: 
target = random.randint(0, span - 1) 
targets to avoid.append(target) 
batch[i * num skips + j] = buffer[skip_window] 
labels[i * num_skips + j, 0] = buffer[target] 
buffer .append(data[data_index] ) 
data_index = (data_index + 1) % len(data) 
return batch, labels 


data_index=0 
batch, labels = generate_batch(8, 2, 1) 


batch, [vocabulary[word] for word in batch] 


(array([3081, 3081, 12, 12, 6, 6, 195, 195], dtype=i 
nt32), 
[“originated', ‘originated’, ‘as’, “as”, 'a', 'a', "term, ‘ter 


m']) 


labels, [vocabulary[word] for word in labels[:, 0]] 


(array([ [5244], 


[ 12), 
[ 6], 
[3081], 
[ 195], 
[ 12], 
[ 6], 


[ 21], dtype=int32), 
['anarchism', 'as', 'a', 'originated', 'term', 'as', 'a', 'of'] 


) 


Build the Model 


batch size - 128 
embedding size - 128 # Dimension of the embedding vector. 


skip window = 1 # How many words to consider left and righ 
El. 
num skips - 2 # How many times to reuse an input to gene 


rate a label. 


# We pick a random validation set to sample nearest neighbors. H 
ere we limit the 

# validation samples to the words that have a low numeric ID, wh 
ich by 

# construction are also the most freguent. 


valid size - 16 # Random set of words to evaluate similarity 
on. 

valid window - 100 # Only pick dev samples in the head of the d 
istribution. 

valid examples - rnd.choice(valid window, valid size, replace-Fa 
lse) 

num_sampled = 64 # Number of negative examples to sample. 


learning_rate = 0.01 


tf.reset_default_graph() 


# Input data. 

train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) 
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) 
valid dataset = tf.constant(valid_examples, dtype-tf.int32) 


# Look up embeddings for inputs. 

init embeddings = tf.random uniform( fvocabulary size, embedding . 
size], -1.0, 1.0) 

embeddings = tf.Variable(init embeddings) 

embed = tf.nn.embedding lookup(embeddings, train_inputs) 


# Construct the variables for the NCE loss 
nce weights = tf.Variable( 


tf.truncated normal( (vocabulary size, embedding_ size], 
stddev-1.0 / np.sgrt(embedding size))) 
nce biases = tf .Variable(tf.zeros([vocabulary_size])) 


+ Compute the average NCE loss for the batch. 
# tf.nce loss automatically draws a new sample of the negative 1 
abels each 
# time we evaluate the loss. 
loss = tf.reduce mean( 
tf.nn.nce loss(nce weights, nce biases, train_labels, embed, 
num sampled, vocabulary size)) 


# Construct the Adam optimizer 
optimizer = tf.train.AdamOptimizer(learning rate) 
training op - optimizer.minimize(loss) 


# Compute the cosine similarity between minibatch examples and a 
11 embeddings. 

norm = tf.sgrt(tf.reduce sum(tf.sguare(embeddings), axis=1, keep 

_dims=True)) 

normalized_embeddings = embeddings / norm 

valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, 
valid_dataset) 

similarity = tf.matmul(valid embeddings, normalized_embeddings, 
transpose_b=True) 


# Add variable initializer. 
init = tf.global_variables_initializer() 


num_steps = 1000 # was 100000? 


with tf.Session() as session: 
init.run() 


average_loss = 0 
for step in range(num steps): 
print("\rIteration: {}".format(step), end="\t") 
batch_inputs, batch_labels = generate_batch(batch_size, 
num_skips, skip window) 


feed dict - ftrain inputs : batch inputs, train labels 
batch labels) 


# We perform one update step by evaluating the training 
op (including it 

# in the list of returned values for session.run() 

_, Joss val = session.run( (training op, loss], feed dict 
=feed_dict) 

average_loss += loss_val 


if step % 2000 == 
if step > O: 
average_loss /= 2000 

# The average loss is an estimate of the loss over t 
he last 2000 batches. 

print( "Average loss at step ", step, ": ", average 1 
OSS) 

average loss - 0 


# Note that this is expensive (-20% slowdown if computed 
every 500 steps) 
if step % 10000 == 
sim = similarity.eval() 
for i in range(valid size): 
valid word = vocabulary[valid_examples[i] ] 
top_k = 8 # number of nearest neighbors 
nearest = (-sim[i, :]).argsort()[1:top_k+1] 
log_str = "Nearest to %s:" % valid_word 
for k in range(top_k): 
Close word = vocabulary[nearest[k] ] 
log str = "%s %s," % (log str, close word) 
print(log str) 


final embeddings = normalized embeddings.eval() 


Iteration: 0 Average loss at step © : 260.603485107 

Nearest to and: marsh, sipe, vehement, exercises, einer, mrnas, 

dancer, grendel, 

Nearest to called: innuendo, algerian, synthesizing, montgomery, 
unspoken, elevating, plankton, monochromatic, 

Nearest to many: salinas, fuji, trochaic, rubinstein, eln, tinti 
n, lloyd, carbides, 

Nearest to about: moreover, congo, choctaws, accomplished, unwie 

ldy, ks, halifax, pac, 

Nearest to than: awake, exact, offutt, gloster, pronunciations, 
delight, tsarina, hopped, 

Nearest to or: long, mage, warriors, adhering, sk, clitoridectom 

y, parenting, vanguard, 

Nearest to of: shakespeare, kemp, relax, cul, breakaway, solemnl 

y, mason, mng, 

Nearest to when: tolstoy, courtesan, hashes, coursing, evi, ren, 
diurnal, stimson, 

Nearest to four: supermassive, soviet, palatalization, acclaimed 
, aided, whitney, filtration, lesbians, 

Nearest to most: din, hawaii, loch, necronomicon, sunnah, sh, on 
ager, miracles, 

Nearest to on: helpers, tangle, heretical, compulsion, unorganiz 
ed, rump, intimidating, israeli, 

Nearest to but: ohio, rican, politeness, watkins, ingesting, str 
eet, hatred, novices, 

Nearest to that: xhosa, distressed, continually, fausto, iole, a 
dmitted, etsi, gross, 

Nearest to all: orissa, persistent, moro, informative, reservati 
on, ren, browne, frobenius, 

Nearest to in: chanced, accelerator, sergio, demonstrating, iner 
tia, jarrett, intricate, orange, 

Nearest to had: irredentist, kbit, sarris, lactate, bettor, narr 
atives, hui, transpired, 

Iteration: 999 


Save final embeddings 


np.save( "my. final embeddings.npy", final embeddings) 


Plot embeddings 


def plot_with_labels(low_dim_embs, labels): 
assert low_dim_embs.shape[0] >= len(labels), "More labels th 
an embeddings" 
plt.figure(figsize=(18, 18)) #in inches 
for i, label in enumerate(labels): 
x, y = low dim embs[i,:] 
plt.scatter(x, y) 
plt.annotate(label, 
xy=(X, y), 
xytext=(5, 2), 
textcoords='offset points', 
ha='right', 
va='bottom' ) 
plt.show() 


from sklearn.manifold import TSNE 


tsne = TSNE(perplexity=30, n components-2, init='pca', n iter-50 
00) 

plot only - 500 

low dim embs = tsne.fit_transform(final_embeddings[:plot_only,:] 
) 

labels = [vocabulary[i] for i in range(plot_only) ] 
plot_with_labels(low_dim_embs, labels) 
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import matplotlib.pyplot as plt 
import numpy as np 

import numpy.random as rnd 
import tensorflow as tf 

import sys 


Data Representations 


e Much easier to remember sequence patterns than to remember exact lists. 
First studied as chess game positions (1970s). 

e Autoencoder converts inputs to internal shorthand, then returns best-guess 
similarity. Two parts: encoder (recognizer) & decoder (generator, aka 
reconstructor). 

e Reconstruction loss - penalizes model when reconstructions /= inputs. 

e Internal representation = lower dimensionality, so AE is forced to learn most 
important features in inputs. 


PCA with Undercomplete Linear Autoencoder 


ch15 autoencoders.md 


# lets build a 3D dataset 


rnd.seed(4) 

m = 100 

Wi, w2 = 0.1, 0.3 
noise = 0.1 


angles = rnd.rand(m) * 3 * np.pi / 2 - 0.5 

X_train = np.empty((m, 3)) 

X_train[:, 9] = np.cos(angles) + np.sin(angles)/2 + noise * rnd. 
randn(m) / 2 
X_train[:, 1] 
X_train[:, 2] 
* rnd.randn(m) 


np.sin(angles) * 0.7 + noise * rnd.randn(m) / 2 
X_train[:, 0] * wi + X_train[:, 1] * w2 + noise 


# normalize it 


from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler() 
X train = scaler.fit transform(X train) 


plt.plot(X train) 
plt.show() 
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# build AE 
from tensorflow.contrib.layers import fully connected 
n inputs = 3 + 3D inputs 


2 # 2D codings 
n_outputs = n inputs 


n hidden 


learning rate - 0.01 


X = tf.placeholder( 
tf.float32, shape=[None, n_inputs]) 


# 

# set activation fn-None & use MSE for cost function 
# to perform simple PCA. 

# 


hidden = fully connected( 
X, 
n_hidden, 
activation_fn=None) 


outputs = fully_connected( 
hidden, 
n_outputs, 
activation_fn=None) 


# MSE 
reconstruction_loss = tf.reduce_mean( 


tf.square(outputs - X)) 


optimizer = tf.train.AdamOptimizer ( 
learning_rate) 


training_op = optimizer.minimize( 
reconstruction_loss) 


init = tf.global_variables_initializer() 


# run the AE 


n_iterations = 10000 
codings - hidden 


with tf.Session() as sess: 
init.run() 
for iteration in range(n iterations): 


training op.run(feed dict-fX: X train)) 


codings val = codings.eval(feed dict-4X: X train!) 


fig = plt.figure(figsize=(4,3)) 
plt.plot(codings_val[:,9], codings_val[:, 1], 
plt.xlabel("$z 1$", fontsize=18) 
plt.ylabel("$z 2$", fontsize-18, rotation-0) 
#ave fig("linear autoencoder pca plot") 
plt.show() 


# plot: 2D projection with max variance 
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Stacked Autoencoders 


e AEs with multiple hidden layers - for more complex model learning 


784 units «<—— Reconstructions 


(= inputs) 





784 units 


tf.reset_default_graph() 


n_inputs = 28 * 28 # for MNIST 
n hidden1 = 300 

n_hidden2 = 150 + codings 
n_hidden3 = n_hidden1 

n_outputs = n_inputs 


learning rate = 0.01 
12 reg = 0.0001 


X = tf.placeholder(tf.float32, 
shape=[None, n_inputs]) 


with tf.contrib.framework.arg_scope( 
[fully_connected], 
activation_fn=tf.nn.elu, 
weights_initializer=tf.contrib.layers.variance_scaling_initi 


alizer(), 

weights_regularizer=tf.contrib.layers.12 regularizer(12_reg) 
DE 

hidden1 = fully_connected(X, n hidden1) 

hidden2 - fully connected(hidden1, n hidden2) # codings 


hidden3 - fully connected(hidden2, n hidden3) 
outputs - fully connected(hidden3, n outputs, activation fn- 
None) 


# MSE 
reconstruction loss = tf.reduce mean( 
tf.square(outputs - X)) 


reg losses - tf.get collection( 
tf.GraphKeys.REGULARIZATION LOSSES) 


loss - tf.add n( 
[reconstruction loss] + reg losses) 


optimizer = tf.train.AdamOptimizer ( 
learning rate) 


training op - optimizer.minimize(loss) 


init - tf.global variables initializer() 
saver - tf.train.Saver() 


# use MNIST dataset 


from tensorflow.examples.tutorials.mnist import input data 


mnist = input data.read data sets("/tmp/data/") 


# train the net. digit labels (y batch) - unused. 


n epochs - 4 
batch size - 150 


with tf.Session() as sess: 
init.run() 
for epoch in range(n epochs): 
n batches - mnist.train.num examples // batch size 
for iteration in range(n batches): 
print("\r{}%".format(100 * iteration // n batches), 


end="") 
sys.stdout.flush() 
X batch, y batch = mnist.train.next batch(batch size 
) 
sess.run(training op, feed dict-fX: X batch!) 
mse train = reconstruction loss.eval(feed dict-iX: X bat 
ch!) 


print("\r{}".format(epoch), "Train MSE:", mse train) 
saver.save(sess, "./my model all layers.ckpt") 


Extracting /tmp/data/train-images-idx3-ubyte.gz 
Extracting /tmp/data/train-labels-idx1-ubyte.gz 
Extracting /tmp/data/t10k-images-idx3-ubyte.gz 
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz 
O Train MSE: 0.02705 

1 Train MSE: 0.0137857 

2 Train MSE: 0.0113694 

3 Train MSE: 0.0107478 


# Utility: plot grayscale 28x28 image 


def plot image( image, shape=[28, 28]): 
plt.imshow(image.reshape(shape), cmap="Greys", interpolation= 
"nearest" ) 
plt.axis("off") 


4 == DA 














# load model, eval on test set (measure reconstruction error, di 
splay original € reconstruction) 


def show reconstructed digits(X, outputs, model path - None, n_t 
est digits - 2): 
with tf.Session() as sess: 
if model path: 
saver.restore(sess, model path) 
X test = mnist.test.images[:n_ test digits| 
outputs val = outputs.eval(feed dict-fX: X test!) 


fig - plt.figure(figsize-(8, 3 * n test digits)) 

for digit index in range(n test digits): 
plt.subplot(n test digits, 2, digit index * 2 + 1) 
plot image(X test|digit index]) 
plt.subplot(n test digits, 2, digit index * 2 + 2) 
plot image(outputs valf|digit index]) 


show reconstructed digits(X, outputs, "./my model all layers.ckp 
LE) 
plt.show() 
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e Used when AE is symmetrical. Tying decoder layer weights to encoder layers 
weights cuts number of weights by 50% (speedup & less memory). 
e Tied weights in TF is cumbersome. Easier to define layers manually. 


tf.reset default graph() 

activation - tf.nn.elu 

regularizer = tf.contrib.layers.12 regularizer(12 reg) 
initializer - tf.contrib.layers.variance scaling initializer() 
X = tf.placeholder(tf.float32, shape=[None, n_inputs]) 


weights1_init = initializer([n_inputs, n hiddeni]) 
weights2_init = initializer([n hidden1, n_hidden2]) 


weights1 = tf.Variable(weights1 init, dtype=tf.float32, name="we 
ights1") 
weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="we 
ights2") 


# weights 3,4 not vars! 


weights3 - tf.transpose(weights2, name-"weights3") # tied weights 
weights4 = tf.transpose(weights1, name="weights4") # tied weights 
biases1 = tf.Variable(tf.zeros(n hidden1), name-"biases1") 

biases2 = tf.Variable(tf.zeros(n_hidden2),name="biases2") 

biases3 = tf.Variable(tf.zeros(n_hidden3),name="biases3") 

biases4 = tf.Variable(tf.zeros(n_outputs),name="biases4") 

hidden1 = activation(tf.matmul(X, weights1) + biases1) 

hidden2 = activation(tf.matmul(hiddeni, weights2) + biases2) 
hidden3 = activation(tf.matmul(hidden2, weights3) + biases3) 


outputs = tf.matmul(hidden3, weights4) + biases4 


reconstruction_loss = tf.reduce_mean( 
tf.square(outputs - X)) 


reg loss = regularizer(weights1) + regularizer(weights2) 

loss - reconstruction loss t reg loss 

optimizer = tf.train.AdamOptimizer(learning rate) 

training op - optimizer.minimize(loss) 

init = tf.global variables initializer() 
Ja ee | 
Training one Autoencoder at a time 


e Often faster to train each shallow AE individually, then stack them. 
e Simplest approach = use separate TF graph for each phase 


ch15 autoencoders.md 
Copy parameters 


= Hidden 1 





Hidden 2 





Phase 1 Phase 2 Phase 3 
Train the first autoencoder Train the second autoencoder Stack the autoencoders 


def train autoencoder ( 
brain, 
n_neurons, 
n_epochs, 
batch_size, 
learning_rate = 0.01, 
12 reg = 0.0005, 
activation_fn=tf.nn.elu): 


graph = tf.Graph() 
with graph.as_default(): 
n_inputs = X_train.shape[1] 


X = tf.placeholder(tf.float32, shape=[None, n_inputs]) 


with tf.contrib.framework.arg_scope( 
[fully_connected], 
activation_fn=activation_fn, 
weights_initializer=tf.contrib.layers.variance_scali 
ng_initializer(), 
weights regularizer-tf.contrib.layers.12 regularizer 


12 reg)): 
hidden = fully connected( 
X, n_neurons, scope-"hidden") 
outputs = fully_connected( 
hidden, n_inputs, activation_fn=None, scope="out 


puts") 
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mse = tf.reduce mean(tf.sguare(outputs - X)) 


reg losses - tf.get collection( 
tf.GraphKeys.REGULARIZATION LOSSES) 


loss - tf.add_n([mse] + reg losses) 


optimizer - tf.train.AdamOptimizer(learning rate) 


training op - optimizer.minimize(loss) 


init = tf.global variables initializer() 


with tf.Session(graph=graph) as sess: 
init.run() 


for epoch in range(n epochs): 
n batches - len(X train) // batch size 


for iteration in range(n batches): 
print("\r{}%".format(100 * iteration // n batche 
s), end="") 


sys.stdout.flush() 


indices - rnd.permutation( 
len(X train))[:batch size] 


X batch = X_train[indices] 


sess.run( 
training op, feed dict-fX: X batch)) 


mse train - mse.eval( 
feed dict-fX: X batch)) 


print("\r{}".format(epoch), "Train MSE:", mse train) 


params - dict( 
[(var.name, var.eval()) for var in tf.get collection 


tf .GraphKeys.TRAINABLE_VARIABLES)]) 


hidden_val = hidden.eval( 
feed dict-fX: X train)) 


return hidden val, params["hidden/weights:0"], params["h 
idden/biases:0"], params["outputs/weights:0"], params| "outputs/b 
lases:0"] 





# train two AEs 


hidden output, W1, b1, W4, b4 = train_autoencoder ( 
mnist.train.images, 
n_neurons=300, 
n epochs-4, 
batch size-150) 


_, W2, b2, W3, b3 = train autoencoder( 
hidden output, 
n neurons-150, 
n epochs-4, 
batch size-150) 


O Train MSE: 0.0193591 
1 Train MSE: 0.0190697 
2 Train MSE: 0.0188801 
3 Train MSE: 0.0192353 
0 Train MSE: 0.00428287 
1 Train MSE: 0.00438113 
2 Train MSE: 0.00464872 
3 Train MSE: 0.00457076 


AE hy 
/” N y 


tf.reset default graph() 
n inputs - 28*28 


X = tf.placeholder(tf.float32, shape=[None, n_inputs]) 
hidden1 = tf.nn.elu(tf.matmul(X, W1) + b1) 

hidden2 = tf.nn.elu(tf.matmul(hidden1, W2) + b2) 
hidden3 = tf.nn.elu(tf.matmul(hidden2, W3) + b3) 
outputs = tf.matmul(hidden3, W4) + b4 


Visualizing Reconstructions 


reusing weights &and biases from above 


# Load model, evaluates it on test set (reconstruction error) 
# display original & reconstructed images 


def show reconstructed digits( 
X, 
outputs, 
model_path = None, 
n_test_digits = 2): 


with tf.Session() as sess: 
if model_path: 
saver.restore(sess, model path) 


X test = mnist.test.images|:n test digits] 
outputs val = outputs.eval(feed dict-fX: X test!) 


fig = plt.figure(figsize=(8, 3 * n test digits)) 


for digit index in range(n test digits): 


pIlt.subplotiIn test didits, 2, digit index * 2 + 1) 
plot image(X test|digit index |) 
plt.subplot(n test digits, 2, digit index * 2 + 2) 
plot image(outputs valf|digit index]) 

plt.show() 


#show reconstructed digits(X, outputs, "./my model all layers.ck 
pt") 
show_reconstructed_digits(X, outputs) 
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Visualizing Features 


e simplest method: find training instances that activate each hidden node the 
most. (best on upper layers, given their tendency to capture high-level 
features.) 


Unsupervised Pretraining with Stacked 
Autoencoders 


Denoising Autoencoders 
Sparse Autoencoders 
Variational Autoencoders 


Other Autoencoders 


Intro & Resources 


e Sutton/Barto ebook; Silver online course 


Learning to Optimize Rewards 


e Definitions: software agents make observations & take actions within an 
environment. In return they can receive rewards (positive or negative). 


Policy Search 


e Policy: the algorithm used by an agent to determine a next action. 
OpenAl Gym (link:) 
e A toolkit for various simulated environments. 


!pip3 install --upgrade gym 


Requirement already up-to-date: gym in /home/bjpcjp/anaconda3/li 


b/python3.5/site-packages 


Requirement already up-to-date: requests>=2.0 in /home/bjpcjp/an 


aconda3/lib/python3.5/site-packages (from gym) 


Reguirement already up-to-date: pyglet»-1.2.0 in /home/bjpcjp/an 


aconda3/lib/python3.5/site-packages (from gym) 


Reguirement already up-to-date: six in /home/bjpcjp/anaconda3/li 


b/python3.5/site-packages (from gym) 


Reguirement already up-to-date: numpy>=1.10.4 in /home/bjpcjp/an 


aconda3/lib/python3.5/site-packages (from gym) 


import gym 

env = gym.make("cartPole-vo") 
obs = env.reset() 

obs 

env.render() 


[2017-04-27 13:05:47,311] Making new env: CartPole-v0 


e make() creates environment 

e reset() returns a 1st env't 

e CartPole() - each observation = 1D numpy array (hposition, velocity, angle, 
angularvelocity) 


/home/bjpcjp/anaconda3/lib/python3.5/site-packages/ipykernel/ main .py 





img = env.render(mode-"rgb array") 
img.shape 


(1, 1, 3) 


# what actions are possible? 
# in this case: 0 = accelerate left, 1 = accelerate right 
env.action_space 


Discrete(2) 


+ pole is leaning right. let's go further to the right. 
action = 1 

obs, reward, done, info = env.step(action) 

obs, reward, done, info 


(array([-0.04061536, 0.1486962 , -0.01966318, -0.29249162]), 1. 


O, False, 41) 


new observation: 
o hpos = obs[0]<0 
o velocity = obs[1]>0 = moving to the right 
o angle = obs[2]>0 = leaning right 
o ang velocity = obs[3]<0 = slowing down? 


reward = 1.0 


done = False (episode not over) 


info = (empty) 


(1) accelerate left when leaning left (2) accelerate right 


def basic_policy(obs): 


angle = obs[2] 
return 0 if angle < 0 else 1 


totals = [] 

for episode in range(500): 
episode_rewards = 0 
obs = env.reset() 
for step in range(1000): # 1000 steps 


L 
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run Torever 
action - basic policy(obs) 
obs, reward, done, info - env.step(action) 
episode rewards += reward 
if done: 
break 
totals.append(episode rewards) 


import numpy as np 
np.mean(totals), np.std(totals), np.min(totals), np.max(totals) 


(41.579999999999998, 8.5249985337242151, 25.0, 62.0) 


NN Policies 


e Observations as inputs - actions to be executed as outputs - determined by 
p(action) 

e approach lets agent find best balance between exploring new actions 8 
reusing known good actions. 


Evaluating Actions: Credit Assignment problem 


Reinforcement Learning (RL) training not like supervised learning. 

RL feedback is via rewards (often sparse & delayed) 

How to determine which previous steps were "good" or "bad"? (aka "credit 
assigmnment problem") 

Common tactic: applying a discount rate to older rewards. 


Use normalization across many episodes to increase score reliability. 


NN Policy Discounts & Rewards 
de 
bot 
1. 
PO _— 
Actions: Right Right Right 
Rewards: +10 0 -50 
| + | + | + 
Sum discounted -22 _ “0 _-50 
rewards: > N sed 


+80% +80% «-...___ Discount 
ratio 





import tensorflow as tf 
from tensorflow.contrib.layers import fully connected 


# 1. Specify the neural network architecture 
n inputs - 4 # == env.observation spa 
ce.shape[0] 
n_hidden = 4 + simple task, don't nee 
d more hidden neurons 
n_outputs = 1 + only output prob(accel 
erating left) 
initializer = tf.contrib.layers.variance_scaling_initializer() 
# 2. Build the neural network 
X = tf.placeholder ( 

tf.float32, shape=[None, n_inputs]) 


hidden = fully_connected( 
X, n_hidden, 
activation_fn=tf.nn.elu, 
weights_initializer=initializer) 


logits = fully_connected( 
hidden, n_outputs, 
activation_fn=None, 
weights_initializer=initializer) 


outputs = tf.nn.sigmoid(logits) # logistic (sigmoid) == 
> return OOO 


# 3. Select a random action based on the estimated probabilities 
p_left_and_right = tf.concat( 
axis=1, values=[outputs, 1 - outputs]) 


action = tf.multinomial( 
tf.log(p_left_and_right), 


num_samples=1) 


init = tf.global_variables_initializer() 


Policy Gradient (PG) algorithms 


e example: "reinforce" algo, 1992 


Markov Decision processes (MDPs) 


e Markov chains = stochastic processes, no memory, fixed #states, random 
transitions 

e Markov decision processes = similar to MCs - agent can choose action; 
transition probabilities depend on the action; transitions can return 
reward/punishment. 

e Goal: find policy with maximum rewards over time. 


Markov Chain Markov Decision Process 


ees 
Co 


e Bellman Optimality Equation: a method to estimate optimal state value of 





any state s. 

e Knowing optimal states = useful, but doesn't tell agent what to do. Q-Value 
algorithm helps solve this problem. Optimal Q-Value of a state-action pair = 
sum of discounted future rewards the agent can expect on average. 


# Define MDP: 


nan=np.nan # represents impossible actions 
T 


np.array([ + shape=[s, a, s'] 
[[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]], 
11030, 4:9, 6.9], nan: han, nani, (0.0, 9.6, 1.811; 
[[nan, nan, nan], [0.8, 0.1, 0-1], nan, nan, nan]; 
1) 
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np.array([ + shape=[s, a, s'] 


A 
Il 


[[10., 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]], 
(Io, 9.0, 0.0), [han, han, nan], love, 9.8, SO, 
[[nan, nan, nan], [40., 0.0, 0.0], [nan, nan, nan]], 


1) 


possible actions = [[0, 1, 2], [0, 2], [1]] 


# run Q-Value Iteration algo 


Q = np.full((3, 3), -np.inf) 
for state, actions in enumerate(possible actions): 

Q[state, actions] = 0.0 # Initial value = 0.0, for all possi 
ble actions 


omon 
0.95 
n_iterations = 100 


learning_rate 


discount_rate 


for iteration in range(n_iterations): 
Q_prev = Q.copy() 
for s in range(3): 
for a in possible actions[s]: 
Q[s, a] = np.sum([ 
T[s, a, sp] * (R[s, a, sp] + discount_rate * np. 
max(Q_prev[sp])) 
for sp in range(3) 


1) 


print("Q: \n",Q) 
print("Optimal action for each state:\n",np.argmax(Q, axis=1)) 


Q: 
(IL 21.88646117 20.79149867 16.854807 ] 
[ 1.10804034 -inf  1.16703135] 
[ -inf 53.8607061 -inf]] 


Optimal action for each state: 
[o 2 1] 


discount rate = 0.90 


for iteration in range(n iterations): 
Q_prev = Q.copy() 
for s in range(3): 
for a in possible actions[s]: 
Q[s, a] = np.sum([ 
T[s, a, sp] * (R[s, a, sp] + discount_rate * np. 
max(Q_prev[sp])) 
for sp in range(3) 


1) 


print("Q: \n",Q) 
print( "Optimal action for each state:\n",np.argmax(Q, axis=1) ) 


Q: 
IL 1.89189499e+01 1.70270580e+01 1.36216526e+01] 
[ 3.09979853e-05 -inf -4.87968388e+00 | 
[ -inf 5.01336811e+01 -inf]] 
Optimal action for each state: 
[0 @ 1] 


Temporal Difference Learning & Q-Learning 


e In general - agent has no knowledge of transition probabilities or rewards 

e Temporal Difference Learning (TD Learning) similar to value iteration, but 
accounts for this lack of knowlege. 

e Algorithm tracks running average of most recent awards & anticipated 
rewards. 


e Q-Learning algorithm adaptation of Q-Value Iteration where initial transition 
probabilities & rewards are unknown. 


import numpy.random as rnd 


learning rated = 0.05 
learning rate decay - 0.1 
n iterations - 20000 


s- 0 # start in state O 
Q = np.full((3, 3), -np.inf) + -inf for impossible actions 


for state, actions in enumerate(possible actions): 
Q[state, actions] = 0.0 + Initial value = 0.0, for all possi 
ble actions 
for iteration in range(n iterations): 
a = rnd.choice(possible actions[s]) + choose an action ( 
randomly) 
sp = rnd.choice(range(3), p=T[s, a]) + pick next state u 
sing T[s, a] 
reward = R[s, a, sp] 


learning rate - learning_rate0 / (1 + iteration * learni 
ng_rate_decay) 


Q[s, a] = learning_rate * Q[s, a] + (1 - learning_rate) 
* (reward + discount rate * np.max(Q[sp])) 


S = sp # move to next state 


print("Q: An 0) 
print( "Optimal action for each state:\n",np.argmax(Q, axis=1)) 


Q: 
LI -inf  2.47032823e-323 -inf] 
[ ©.00000000e+000 -inf  0.00000000e+000] 
[ -inf ©.00000000e+000 -inf]] 


Optimal action for each state: 
[1 0 1] 


Exploration Policies 


e Q-Learning works only if exploration is thorough - not always possible. 
e Better alternative: explore more interesting routes using a sigma probability 


Approximate A-Learning 


e TODO 


Ms Pac-Man with Deep Q-Learning 


env = gym.make('MsPacman-v0' ) 
obs = env.reset() 
obs.shape, env.action_space 


[2017-04-27 13:06:21,861] Making new env: MsPacman-v0 


((210, 160, 3), Discrete(9)) 
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mspacman color = np.array([210, 164, 74]).mean() 


# crop image, shrink to 88x80 pixels, convert to grayscale, impr 
ove contrast 


def preprocess_observation(obs): 
img = obs[1:176:2, ::2] + crop and downsize 
img = img.mean(axis=2) + to greyscale 
img[img==mspacman_color] = 0 + improve contrast 
img = (img - 128) / 128 - 1 # normalize from -1. to 1. 
return img.reshape(88, 80, 1) 


Ms PacMan Observation Deep-Q net 


Original observation (160x210 RGB) Output 5 Q-Values 


Preprocessed observation (88x80 greyscale) 


Fully Connected 
9 units 


Fully Connected 
512 units 


shape 
eee 11x10x64 
64, 4x4 +2(S 
32, 8x8 +4(S 


$ 88x80x1 





Input = State 


# Create DON 
# 3 convo layers, then 2 FC layers including output layer 


from tensorflow.contrib.layers import convolution2d, fully conne 
cted 


input height = 88 

input_width = 80 

input_channels = 1 

conv_n_maps = [32, 64, 64] 
conv_kernel_sizes = [(8,8), (4,4), (3,3)] 
conv_strides = [4, 2, 1] 
conv_paddings = ["SAME"]*3 
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conv activation = [tf.nn.relu]*3 


n hidden in = 64 * 11 * 10 + conv3 has 64 maps of 11x10 ea 
ch 
n_hidden = 512 


hidden_activation = tf.nn.relu 
env.action_space.n + 9 discrete actions are 


n_outputs 
available 


initializer = tf.contrib.layers.variance_scaling_initializer() 


# training will need ***TWO*** DONS: 

# one to train the actor 

# another to learn from trials € errors (critic) 
# g network is our net builder. 


def g network(X state, scope): 
prev layer - X state 
conv layers - [] 


with tf.variable scope(scope) as scope: 


for n maps, kernel size, stride, padding, activation in 
zip( 
conv_n_maps, 
conv_kernel_sizes, 
conv_strides, 
conv_paddings, 
conv activation): 


prev layer = convolution2d( 
prev layer, 
num outputs-n maps, 
kernel size-kernel size, 
stride-stride, 
padding-padding, 
activation fn-activation, 
weights initializer-initializer) 


conv layers.append(prev layer) 


last conv layer flat = tf.reshape( 
prev layer, 
shape=[-1, n hidden in]) 


hidden = fully connected 
last conv layer flat, 





n hidden, 
activation fn-hidden activation, 
weights initializer-initializer) 


outputs - fully connected( 
hidden, 
n outputs, 
activation_fn=None, 
weights_initializer=initializer) 


trainable_vars = tf.get_collection( 
tf.GraphKeys.TRAINABLE_VARIABLES, 


scope=scope.name) 


trainable vars by name = {var.name[len(scope.name):]: var 
for var in trainable_vars) 


return outputs, trainable vars by name 


# create input placeholders & two DQNs 


X_state = tf.placeholder ( 
tf.float32, 
shape=[None, input height, input width, 
input channels]) 


actor g values, actor vars 
rks/actor") 


q_network(X_state, scope="q_netwo 


critic_q_values, critic_vars = q_network(X_state, scope="q_netwo 
rks/eritic") 


Copy ops = [actor_var.assign(critic_vars[var_name]) 
for var_name, actor_var in actor_vars.items()] 
# op to copy all trainable vars of critic DQN to actor DON... 


# use tf.group() to group all assignment ops together 


copy_critic_to_actor = tf.group(*copy_ops) 
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# Critic DON learns by matching Q-Value predictions 
# to actor's Q-Value estimations during game play 


# Actor will use a "replay memory" (5 tuples): 
# state, action, next-state, reward, (O=over/1=continue) 


+ use normal supervised training ops 
# occasionally copy critic DQN to actor DQN 


# DQN normally returns one Q-Value for every poss. action 

+ only need Q-Value of action actually chosen 

# So, convert action to one-hot vector [0...1...0], multiple by 
Q- values 

+ then sum over 1st axis. 


X action = tf .placeholder( 
tf.int32, shape=[None]) 


q_value = tf.reduce sum( 


critic_q_values * tf.one_hot(X_action, n_outputs), 
axis=1, keep_dims=True) 
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tf.reset default graph() 


y = tf.placeholder( 
tf.float32, shape=[None, 1]) 


cost = tf.reduce mean( 
tf.square(y - q_value)) 


non-trainable. minimize() op will manage incrementing it 
global step = tf .Variable( 
0, 
trainable=False, 
name='global_step') 
optimizer = tf.train.AdamOptimizer(learning_rate) 
training_op = optimizer.minimize(cost, global_step=global_step) 


init = tf.global_variables_initializer() 


saver = tf.train.Saver() 


ValueError Traceback (most recent 
call last) 


<ipython-input-54-ae5a849b8026> in <module>() 


7 
8 cost = tf.reduce_mean( 

----> 9 tf.square(y - q_value)) 
10 


11 + non-trainable. minimize() op will manage incrementing 
it 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/ops/math_ops.py in binary op wrapper(x, y) 

879 

880 def binary_op_wrapper(x, y): 


--5 881 with ops.name_scope(None, op_name, [x, y]) as name: 
882 if not isinstance(y, sparse tensor.SparseTensor): 
883 y = ops.convert to tensor(y, dtype-x.dtype.base 


dtype, name="y") 


/home/bjpcjp/anaconda3/lib/python3.5/contextlib.py in enter ( 
self) 


57 def _ enter_ (self): 

58 try: 
---> 59 return next(self.gen) 

60 except StopIteration: 

61 raise RuntimeError("generator didn't yield") 
from None 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/framework/ops.py in name_scope(name, default_name, values) 
4217 if values is None: 
4218 values = [] 
-> 4219 g = _get_graph_from_inputs(values) 
4220 with g.as default(), g.name_scope(n) as scope: 
4221 yield scope 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/framework/ops.py in _get_graph_from_inputs(op_input_list, g 
raph) 


3966 graph - graph element.graph 

3967 elif original graph element is not None: 
-> 3968 _assert_ same graph(original graph element, graph 
element) 

3969 elif graph element.graph is not graph: 


3970 raise ValueError( 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/framework/ops.py in _assert_same_graph(original_item, item) 
3905 if original item.graph is not item.graph: 


3906 raise ValueError( 
-5 3907 "%S must be from the same graph as %s." % (item, 
original item)) 

3908 

3909 


ValueError: Tensor("Sum 1:0", shape-(?, 1), dtype-float32) must 
be from the same graph as Tensor("Placeholder:0", shape=(?, 1), 
dtype=float32). 


# use a deque list to build the replay memory 


from collections import deque 


replay_memory_size = 10000 
replay_memory = deque( 
[], maxlen=replay_memory_size) 


def sample memories(batch size): 
indices = rnd.permutation( 
len(replay_memory))[:batch_size] 
cols = [[], [], [], Ll, []] # state, action, reward, next st 


ate, continue 


for idx in índices: 
memory = replay_memory[idx] 
for col, value in zip(cols, memory): 
col.append(value) 


cols = [np.array(col) for col in cols] 
return (cols[0], cols[1], cols[2].reshape(-1, 1), cols[3], c 
ols[4].reshape(-1, 1)) 
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# create an actor 

# use epsilon-greedy policy 

# gradually decrease epsilon from 1.0 to 0.05 across 50K trainin 
g steps 


eps min - 0.05 


eps max - 1.0 
eps decay steps = 50000 


def epsilon greedy(g values, step): 
epsilon - max(eps min, eps max - (eps max-eps min) * step/ep 
s decay steps) 
if rnd.rand() « epsilon: 
return rnd.randint(n outputs) # random action 
else: 
return np.argmax(g values) # optimal action 


# training setup: the variables 


n steps - 100000 # total number of training steps 

training start - 1000 # start training after 1,000 game iteratio 

ns 

training interval - 3 # run a training step every 3 game iterati 

ons 

save steps - 50 # save the model every 50 training steps 

copy steps 
steps 


25 # copy the critic to the actor every 25 training 


discount rate = 0.95 

skip start - 90 # skip the start of every game (it's just waitin 
g time) 

batch size - 50 

iteration - 0 # game iterations 

checkpoint path - "./my dgn.ckpt" 

done = True + env needs to be reset 


# let's get busy 
import os 
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with tf.Session() as sess: 


# restore models if checkpoint file exists 
if os.path.isfile(checkpoint path): 
saver.restore(sess, checkpoint path) 


# otherwise normally initialize variables 
else: 
init.run() 


while True: 
step = global_step.eval() 
if step >= n_steps: 
break 


# iteration = total number of game steps from beginning 


iteration += 1 
if done: # game over, start again 
obs = env.reset() 


for skip in range(skip_start): + skip the start of e 
ach game 
obs, reward, done, info = env.step(0) 
state = preprocess observation(obs) 


# Actor evaluates what to do 
g. values = actor ad values.eval(feed dict={X state: [stat 


e]}) 


action = epsilon greedy(g values, step) 


# Actor plays 
obs, reward, done, info - env.step(action) 
next state = preprocess_observation(obs) 


# Let's memorize what just happened 

replay memory.append( (state, action, reward, next state, 
1.0 - done)) 

state - next state 


if iteration « training start or iteration % training in 
terval !- 0: 
continue 


# Critic learns 
X state val, X action val, rewards, X next state val, co 


( 


sample memories(batch size)) 


ntinues 


next_q_values = actor_q_values.eval( 
feed dict-fX state: X next state val}) 


max next g values - np.max( 
next g values, axis=1, keepdims=True) 


y val = rewards + continues * discount rate * max next ad 
values 


training op.run( 
feed dict-fX state: X state val, X action: X action 
val, y: y_val}) 


# Regularly copy critic to actor 


if step % copy_steps == 
copy critic to actor.run() 


# And save regularly 
if step % save steps == 
saver.save(sess, checkpoint path) 
print("\n",np.average(y_val) ) 
A | (| 
1.09000234097 


1, 35392784142 


1. 56906713688 


. 5765440191 


.57079289043 


. 75170834792 


. 97005553639 


. 97246688247 


. 16126081383 


. 550295331 


. 75750140131 


. 56052656734 


. 7519523176 


. 14495741558 


. 95223849511 


. 35289915931 


. 56913152564 


. 96387254691 


. 76067311585 


. 35536773229 


. 54768545294 


. 53594982147 


. 56104325151 


.96987313104 


. 35546155441 


. 5688166486 


. 08286282682 


. 28864161086 


. 2878398273 


. 09510449028 


. 09807873964 


. 90697311211 


.07757974195 


.09214673901 


. 28402029777 


. 28337000942 


. 4255889504 


.49763186431 


. 85764229989 


. 04482784653 


. 68228099513 


. 28635532999 


. 29647485089 


. 07898310328 


. 10530596256 


. 21691918874 


.09561720395 


. 67830030346 


.09576807404 


. 288335078 


. 0956065948 


. 21222548962 


¿21721751595 


. 7905973649 


. 59864345837 


. 39875211382 


. 51839643717 


. 59503188992 


.01186150789 


.11968219852 


. 78787856865 


. 20382899523 


. 20432999897 


.0028930707 


.20069698572 


. 80375980473 


. 19750945711 


. 20367767668 


. 19593407536 


. 40061367989 


. 6054182477 


. 79921974087 


. 38844807434 


. 20397897291 


. 60095557356 


. 59488785553 


. 15924422598 


. 15949315596 


.16320213652 


.36019721937 


. 56076610899 


. 16949163198 


. 75895399189 


5.96050115204 


5.97032629395 


KeyboardInterrupt Traceback (most recent 
call last) 


<ipython-input -44-d0da605267f3> in <module>() 


46 
A7 next g values = actor g values.eval( 
---> 48 feed dict-fX state: X next state val)) 
49 
50 max next g.values = np.max( 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/framework/ops.py in eval(self, feed_dict, session) 

579 

580 LE 
--> 581 return _eval using default session(self, feed dict, 
self .graph, session) 

582 

583 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/framework/ops.py in _eval_using_default_session(tensors, fe 
ed_dict, graph, session) 


3795 "the tensor's graph is different 
from the session's " 

3796 "graph.") 
-> 3797 return session.run(tensors, feed_dict) 

3798 


3799 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/client/session.py in run(self, fetches, feed_dict, options, 
run_metadata) 

765 try: 

766 result = self._run(None, fetches, feed_dict, optio 
ns_ptr, 
--> 767 run_metadata_ptr) 

768 if run metadata: 

769 proto_data = tf_session.TF_GetBuffer(run_metadat 
a_ptr) 


/home/bjpcjp/anaconda3/1lib/python3.5/site-packages/tensorflow/py 
thon/client/session.py in _run(self, handle, fetches, feed_dict, 
options, run_metadata) 


963 if final fetches or final_targets: 

964 results - self. do run(handle, final targets, fina 
1 fetches, 
--5 965 feed dict string, options, 
run metadata) 

966 else: 

967 results = [] 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/client/session.py in _do_run(self, handle, target_list, fet 
ch_list, feed_dict, options, run_metadata) 


1013 if handle is None: 

1014 return self. do call( run fn, self. session, feed 
dict, fetch list, 
-5 1015 target list, options, run met 
adata) 

1016 else: 

1017 return self. do call( prun fn, self. session, hand 


le, feed dict, 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/client/session.py in do call(self, fn, *args) 


1020 def do call(self, fn, *args): 


1021 try: 
-> 1022 return fn(*args) 
1023 except errors.OpError as e: 
1024 message = compat.as_text(e.message) 


/home/bjpcjp/anaconda3/1ib/python3.5/site-packages/tensorflow/py 
thon/client/session.py in _run_fn(session, feed_dict, fetch_list 
, target list, options, run metadata) 


1002 return tf session.TF Run(session, options, 
1003 feed_dict, fetch_list, 
target_list, 
-> 1004 status, run_metadata) 
1005 
1006 def _prun_fn(session, handle, feed_dict, fetch_list) 


KeyboardInterrupt: 


